Definition
Deliberate manipulation of the model by a malicious actor, through the introduction of malicious data at initial training or during use. This can lead to security vulnerabilities or inaccurate and harmful outputs.
Interactive deep-dive
This risk surfaces under more than one interactive treatment โ each with its own technical detail, attack surface, detection signals, and scenarios.
Controls & guardrails that address this
101 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Design strict RBAC on training data repositories at design stage. Define approved contributor list and approval workflow.
Apply anomaly detection on the training data ingestion pipeline to identify poisoned or tampered batches.
Define and approve the source allow-list and write-time scanning during build. Prove non-allow-listed and injection-bearing writes are rejected before go-live.
source: OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning, LLM08:2025 Vector and Embedding Weaknesses; NIST SP 800-53 AC-3 / SI-7Conduct a data poisoning threat assessment at design stage. Identify likely attack vectors and assign risk ratings.
Simulate data poisoning attacks (backdoor, label flipping, gradient-based) to assess model resilience before deployment.
Verify a signed attestation and content hash on every dataset shard at ingestion. Reject unsigned or hash-mismatched data before it reaches the training pipeline.
source: MITRE ATLAS AML.M0007 (Sanitize Training Data), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity, SR-4 ProvenanceGate every model promotion on backdoor-trigger probes and a behavioral diff against the approved baseline. Block release on significant regressions or trigger-pattern anomalies.
source: MITRE ATLAS AML.M0014 (Verify ML Artifacts), AML.M0019 (Red Teaming); NIST AI RMF MANAGE 2.2 and MEASURE 2.7Penetration test the training data pipeline to identify injection points and access control weaknesses.
Scan every ingestion batch with spectral-signature and clustering detectors before training. Quarantine flagged clusters for human review against documented thresholds.
source: MITRE ATLAS AML.M0007 (Sanitize Training Data); OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning; NIST AI RMF MEASURE 2.7Continuously correlate live agent-memory writes against output behaviour to flag drift, then quarantine and roll back the suspected-poisoned memory record across all affected sessions.
source: Interactive-control reconciliation: ctrl-memory-quarantine (partial coverage)Real-world cases
4Actual published events that illustrate this risk โ click through for the writeup and sources.
Split-view and frontrunning attacks let an attacker poison a fraction of datasets like LAION by buying expired domains behind dataset URLs.
Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters โ suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.
Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.
Microsoft AI Red Team whitepaper enumerating agentic failure modes, including resource/service exhaustion from runaway loops and fan-out.