Web-scale dataset poisoning is practical (Carlini et al.)
Research demonstration20 Feb 2023 (rev. 2024)πΊοΈ Training-Data PipelineSplit-view and frontrunning attacks let an attacker poison a fraction of datasets like LAION by buying expired domains behind dataset URLs.
Root cause β why it happened
Giant AI models learn from huge piles of pictures and text scraped off the web. To keep these piles manageable, the people who build the datasets don't store the actual content β they store a list of web addresses (URLs) pointing at it. The catch: the web changes. Whoever later trains a model re-downloads from those addresses, and by then some of the addresses may point somewhere else. The researchers showed that an attacker can simply buy up web addresses that have expired β addresses the dataset still points at β and put whatever they like there. Because nobody checks that the content still matches what was originally collected, the model quietly learns from the attacker's content. It was cheap enough that a single person could do it.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A dataset is curated as a list of URLs
The people who build a big image dataset don't keep the pictures β they keep a list of web addresses where each picture lives, plus a short description. At the moment they make the list, everything checks out: each address points at the picture they expected. The list is then published for anyone to use to train a model.
# one row of a web-scale image/text index (simplified)
{
"url": "http://img.some-host.example/2019/cat-on-sofa.jpg",
"caption": "a cat sitting on a sofa",
"width": 640, "height": 480,
"sha256": null # <-- no content hash recorded at curation
}
# the URL is the ONLY binding between index and bytesControls & guardrails β what would have stopped it
The single fix that closes this: when you build the dataset, save a fingerprint of each piece of content β not just its web address. When you later download to train, re-check the fingerprint. If the content was swapped (split-view) or sneakily edited (frontrunning), the fingerprint won't match and you throw that sample away. Knowing where each piece came from, looking for sudden clusters of content pointing at one brand-new website, and testing the finished model before launch all add backup, but the fingerprint check is the real lock.
- Ingestion sanitisation & source allowlistingaddressesKnowledge / Training Data Poisoning
Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
- Weight provenance, hashing & pre-deploy evalsaddressesKnowledge / Training Data Poisoning
Hashes prove the file is unchanged, not that it's safe β a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- Provenance & content signingaddressesKnowledge / Training Data Poisoning
Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
- Runtime monitoring & anomaly detectionaddressesKnowledge / Training Data Poisoning
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- βΈ Addressing training data by URL instead of by content hash creates a curate-time/download-time gap an attacker controls; pin content by hash at curation and verify on download.
- βΈ Owning the bytes behind a fraction of a dataset can be cheap β re-registering expired domains the index still references reportedly cost a few hundred dollars for a meaningful slice (Carlini et al.).
- βΈ Snapshotting pipelines leak timing: if an attacker can predict when a source is captured, they can frontrun the snapshot and revert, leaving the live source clean and the dataset poisoned.
- βΈ Even a small, bounded fraction of poisoned samples can implant a targeted effect, so 'we only lost a fraction of a percent' is not a safety margin.
- βΈ Data-supply-chain integrity is an upstream control: once poison is trained into weights there is no cheap patch, so the defence must sit at ingestion, before training.
Sources
- Poisoning Web-Scale Training Datasets is Practical (arXiv:2302.10149) β
- Poisoning Web-Scale Training Datasets is Practical β Florian TramΓ¨r (author page, IEEE S&P 2024) β
- Poisoning Web-Scale Training Datasets is Practical (arXiv:2302.10149) β β Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, TramΓ¨r. Split-view + frontrunning; IEEE S&P 2024.
- Poisoning Web-Scale Training Datasets is Practical β Florian TramΓ¨r (author page) β β Author summary; cost/feasibility figures and the content-integrity defence.
Practise the risk class β related scenarios
An attacker edits the wiki; the assistant cites the lie back to everyone
An attacker crafts a gibberish passage whose embedding sits near thousands of questions β so it's retrieved everywhere
The safety guard is itself a trained model β and someone poisoned its lessons