πŸ”AI RiskAtlas
← Real-world cases
Case study

Web-scale dataset poisoning is practical (Carlini et al.)

Research demonstration20 Feb 2023 (rev. 2024)πŸ—ΊοΈ Training-Data Pipeline

Split-view and frontrunning attacks let an attacker poison a fraction of datasets like LAION by buying expired domains behind dataset URLs.

Root cause β€” why it happened

Giant AI models learn from huge piles of pictures and text scraped off the web. To keep these piles manageable, the people who build the datasets don't store the actual content β€” they store a list of web addresses (URLs) pointing at it. The catch: the web changes. Whoever later trains a model re-downloads from those addresses, and by then some of the addresses may point somewhere else. The researchers showed that an attacker can simply buy up web addresses that have expired β€” addresses the dataset still points at β€” and put whatever they like there. Because nobody checks that the content still matches what was originally collected, the model quietly learns from the attacker's content. It was cheap enough that a single person could do it.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Untrusted web (mutable)Data pipelineModelscraped at time Tdataset snapshot🌐Web sources(URLs)πŸ“₯Crawl / scrapeπŸ—„οΈTrainingdataset🧬Trained weights🧠Model🌐Attacker'sexpired domain🌐Snapshottedsource (e.g.
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 6

A dataset is curated as a list of URLs

The people who build a big image dataset don't keep the pictures β€” they keep a list of web addresses where each picture lives, plus a short description. At the moment they make the list, everything checks out: each address points at the picture they expected. The list is then published for anyone to use to train a model.

βš™οΈDataset manifest entry (illustrative)config
# one row of a web-scale image/text index (simplified)
{
  "url":     "http://img.some-host.example/2019/cat-on-sofa.jpg",
  "caption": "a cat sitting on a sofa",
  "width": 640, "height": 480,
  "sha256": null            # <-- no content hash recorded at curation
}
# the URL is the ONLY binding between index and bytes
Step 1 / 6

Controls & guardrails β€” what would have stopped it

The single fix that closes this: when you build the dataset, save a fingerprint of each piece of content β€” not just its web address. When you later download to train, re-check the fingerprint. If the content was swapped (split-view) or sneakily edited (frontrunning), the fingerprint won't match and you throw that sample away. Knowing where each piece came from, looking for sudden clusters of content pointing at one brand-new website, and testing the finished model before launch all add backup, but the fingerprint check is the real lock.

Preventive
  • Ingestion sanitisation & source allowlisting

    Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.

  • Weight provenance, hashing & pre-deploy evals

    Hashes prove the file is unchanged, not that it's safe β€” a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

Detective
  • Provenance & content signing

    Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • β–Έ Addressing training data by URL instead of by content hash creates a curate-time/download-time gap an attacker controls; pin content by hash at curation and verify on download.
  • β–Έ Owning the bytes behind a fraction of a dataset can be cheap β€” re-registering expired domains the index still references reportedly cost a few hundred dollars for a meaningful slice (Carlini et al.).
  • β–Έ Snapshotting pipelines leak timing: if an attacker can predict when a source is captured, they can frontrun the snapshot and revert, leaving the live source clean and the dataset poisoned.
  • β–Έ Even a small, bounded fraction of poisoned samples can implant a targeted effect, so 'we only lost a fraction of a percent' is not a safety margin.
  • β–Έ Data-supply-chain integrity is an upstream control: once poison is trained into weights there is no cheap patch, so the defence must sit at ingestion, before training.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—