🔍AI RiskAtlas
← Risk taxonomy

Training-Data Rights & Provenance

mediumData & knowledge
Also known as: copyright in training data, data consent, memorization / regurgitation

Definition

Models are trained on huge piles of images, audio, and text — often scraped without clear permission. That raises copyright and consent problems, and the model can sometimes memorize and spit back its training examples (a watermark, a real photo, private text).

Where it attaches

The system components this risk arises at.

📚 Training Corpus📥 Ingestion Pipeline🧬 Model Weights & Registry🧠 LLM

Detection signals

  • Training data with no licence/provenance record or opt-out handling
  • Outputs reproducing watermarks, signatures, or near-verbatim sources
  • Scraped biometric (face/voice) data with no consent basis
  • Inability to trace a sample's lineage for a takedown/audit request

Controls & guardrails that address this

8

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 5
Declared data sources and provenance at intake

Declare all planned training and test data sources at use case intake, with provenance status for each.

Lifecycle stage1 – Use Case Context & Design
Post hoc interpretability techniques

Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.

Lifecycle stage1 – Use Case Context & Design
Documented data provenance during collection

Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.

Lifecycle stage2 – Data Acquisition & Processing
Confidence scoring

Apply data quality scoring to all acquired data to document provenance reliability. Flag low-confidence sources for review.

Lifecycle stage2 – Data Acquisition & Processing
Also addressesHallucination
Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

Detective · 2
Provenance & content signinginteractive

Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
MITRE ATLAS
NIST AI RMF
  • MAP 4.1
  • MEASURE 2.10

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗