#13

Unclear provenance for training/test data

Risk taxonomy

Definition

The data used to train and test the model cannot be convincingly and comprehensively traced, presenting challenges for audit, disclosure, and compliance, as well as posing the risk of the FI not having the right to use the data.

Interactive deep-dive

This risk has an interactive treatment with technical detail, attack surface, detection signals, and scenarios.

▶ Training-Data Rights & Provenance →

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 4

Declared data sources and provenance at intake

Declare all planned training and test data sources at use case intake, with provenance status for each.

Lifecycle stage1 – Use Case Context & Design

Post hoc interpretability techniques

Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.

Lifecycle stage1 – Use Case Context & Design

Documented data provenance during collection

Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.

Lifecycle stage2 – Data Acquisition & Processing

Confidence scoring

Apply data quality scoring to all acquired data to document provenance reliability. Flag low-confidence sources for review.

Lifecycle stage2 – Data Acquisition & Processing

Also addressesHallucination

Open these in the Control Library →

Other risks in Transparency

#12 Unclear output accuracy #14 Lack of explainability #15 Anthropomorphism