Case study

Web-scale dataset poisoning is practical (Carlini et al.)

Research demonstration20 Feb 2023 (rev. 2024)🗺️ Training-Data Pipeline

Split-view and frontrunning attacks let an attacker poison a fraction of datasets like LAION by buying expired domains behind dataset URLs.

Root cause — why it happened

Giant AI models learn from huge piles of pictures and text scraped off the web. To keep these piles manageable, the people who build the datasets don't store the actual content — they store a list of web addresses (URLs) pointing at it. The catch: the web changes. Whoever later trains a model re-downloads from those addresses, and by then some of the addresses may point somewhere else. The researchers showed that an attacker can simply buy up web addresses that have expired — addresses the dataset still points at — and put whatever they like there. Because nobody checks that the content still matches what was originally collected, the model quietly learns from the attacker's content. It was cheap enough that a single person could do it.

Risks this case illustrates

Knowledge / Training Data Poisoning

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

A dataset is curated as a list of URLs

The people who build a big image dataset don't keep the pictures — they keep a list of web addresses where each picture lives, plus a short description. At the moment they make the list, everything checks out: each address points at the picture they expected. The list is then published for anyone to use to train a model.

⚙️Dataset manifest entry (illustrative)config

# one row of a web-scale image/text index (simplified)
{
  "url":     "http://img.some-host.example/2019/cat-on-sofa.jpg",
  "caption": "a cat sitting on a sofa",
  "width": 640, "height": 480,
  "sha256": null            # <-- no content hash recorded at curation
}
# the URL is the ONLY binding between index and bytes

Step 1 / 6

Controls & guardrails — what would have stopped it

The single fix that closes this: when you build the dataset, save a fingerprint of each piece of content — not just its web address. When you later download to train, re-check the fingerprint. If the content was swapped (split-view) or sneakily edited (frontrunning), the fingerprint won't match and you throw that sample away. Knowing where each piece came from, looking for sudden clusters of content pointing at one brand-new website, and testing the finished model before launch all add backup, but the fingerprint check is the real lock.

Preventive

Ingestion sanitisation & source allowlisting
addressesKnowledge / Training Data Poisoning
Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
Weight provenance, hashing & pre-deploy evals
addressesKnowledge / Training Data Poisoning
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

Detective

Provenance & content signing
addressesKnowledge / Training Data Poisoning
Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
Runtime monitoring & anomaly detection
addressesKnowledge / Training Data Poisoning
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

Corrective

Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Knowledge / Training Data Poisoning →

Lessons

▸ Addressing training data by URL instead of by content hash creates a curate-time/download-time gap an attacker controls; pin content by hash at curation and verify on download.
▸ Owning the bytes behind a fraction of a dataset can be cheap — re-registering expired domains the index still references reportedly cost a few hundred dollars for a meaningful slice (Carlini et al.).
▸ Snapshotting pipelines leak timing: if an attacker can predict when a source is captured, they can frontrun the snapshot and revert, leaving the live source clean and the dataset poisoned.
▸ Even a small, bounded fraction of poisoned samples can implant a targeted effect, so 'we only lost a fraction of a percent' is not a safety margin.
▸ Data-supply-chain integrity is an upstream control: once poison is trained into weights there is no cheap patch, so the defence must sit at ingestion, before training.

Sources

Poisoning Web-Scale Training Datasets is Practical (arXiv:2302.10149) ↗
Poisoning Web-Scale Training Datasets is Practical — Florian Tramèr (author page, IEEE S&P 2024) ↗
Poisoning Web-Scale Training Datasets is Practical (arXiv:2302.10149) ↗ — Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramèr. Split-view + frontrunning; IEEE S&P 2024.
Poisoning Web-Scale Training Datasets is Practical — Florian Tramèr (author page) ↗ — Author summary; cost/feasibility figures and the content-integrity defence.

Practise the risk class — related scenarios

☠️Poisoning the Well

An attacker edits the wiki; the assistant cites the lie back to everyone

🧲Poison the Vector, Not the Words

An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons