Training-Data Rights & Provenance
mediumData & knowledgeDefinition
Models are trained on huge piles of images, audio, and text — often scraped without clear permission. That raises copyright and consent problems, and the model can sometimes memorize and spit back its training examples (a watermark, a real photo, private text).
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Training data with no licence/provenance record or opt-out handling
- ▸ Outputs reproducing watermarks, signatures, or near-verbatim sources
- ▸ Scraped biometric (face/voice) data with no consent basis
- ▸ Inability to trace a sample's lineage for a takedown/audit request
Controls & guardrails that address this
8Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Declare all planned training and test data sources at use case intake, with provenance status for each.
Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.
Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.
Apply data quality scoring to all acquired data to document provenance reliability. Flag low-confidence sources for review.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Framework mappings
- MAP 4.1
- MEASURE 2.10