#26

Training data or inputs not fit for purpose

Risk taxonomy

Definition

Training data used in the model is not representative of the geographical and cultural context where the model will be used, or not aligned to the system's intended goal, leading to incorrect outputs.

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 3

Training data fitness requirements at design

Define training data fitness requirements at design stage including domain coverage, recency, and format specifications.

Lifecycle stage1 – Use Case Context & Design

AI onboarding using domain data

Plan the domain data strategy at design stage: identify sources that best cover the target operational distribution.

Lifecycle stages1 – Use Case Context & Design2 – Data Acquisition & Processing

Input filtering

Screen acquired training data through automated fitness checks (domain relevance, recency, format conformity). Reject non-conforming data.

Lifecycle stage2 – Data Acquisition & Processing

Also addressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Sensitive Data Leakage

Detective · 2

Synthetic evaluation datasets

Construct synthetic evaluation datasets targeting operational edge cases identified in S2 gap analysis. Use as regression baseline.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Robustness testing

Monitor production input distributions for drift from training data distribution. Trigger re-training when covariate shift is confirmed.

Lifecycle stage5 – Usage, Monitoring & Change

Also addressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Open these in the Control Library →

Other risks in Robustness & Stability

#24 Hallucination / Fabrication / Confabulation #25 Overconfidence #27 Lack of continuous monitoring #28 Insufficient data quality #29 Model staleness #30 Insufficient model accuracy / soundness #31 Model degradation from unexpected use #32 Inadequate operational resilience #33 Unmet architectural requirements #34 Lack of reproducibility #44 Disruption to connected systems