🔍AI RiskAtlas
← Real-world cases
Case study

A small number of samples can poison LLMs of any size (~250-document backdoor)

Research demonstration08 Oct 2025🗺️ Training-Data Pipeline

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

Root cause — why it happened

Big models learn from huge piles of web text that a crawler scoops up automatically. Researchers asked a scary question: how many bad documents does an attacker need to sneak in to teach the model a hidden trick? The intuition was 'a percentage of the pile,' so bigger models would need far more. But in their experiments, a roughly FIXED number — about 250 planted documents — was enough to install the same backdoor whether the model was small or large. If that holds, the cost of poisoning doesn't grow with the model; an attacker just needs to get a couple hundred pages onto the web and crawled. The trick they demonstrated was narrow and harmless-ish: a special phrase makes the model spit out gibberish.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Untrusted web (mutable)Data pipelineModelscraped at time Tdataset snapshottrainbecomes🌐Web sources(URLs)📥Crawl / scrape🗄️Trainingdataset🧬Trained weights🧠Model🌐Attacker-authoreddocs (~250)
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect its risks
SetupStep 1 / 6

A research question about poisoning cost

Researchers from Anthropic, the UK AI Security Institute and the Alan Turing Institute set out to measure something nobody had pinned down at scale: how many poisoned documents does it actually take to plant a hidden trick in a model — and does that number get bigger as the model gets bigger?

⚙️Study design (as reported)config
study: near-constant poison samples (Souly et al., 2025)
models:   600M, 2B, 7B, 13B params (sweep)
datasets: ~6B -> ~260B tokens (Chinchilla-scaled)
variable: # of poisoned documents to install a backdoor
question: does that # scale with model/corpus size?
kind: RESEARCH demonstration (not a live incident)
Step 1 / 6

Controls & guardrails — what would have stopped it

Two things help most. First, be careful what goes into the training pile: only pull from trusted sources, keep a label of where every document came from, and look for documents that don't fit. Second, before the model ever ships, test it hard — including with the suspicious trigger phrases an attacker might use — so a planted trick gets caught. The honest catch: a quieter, cleverer trigger can dodge both, so neither is a guarantee.

Preventive
Detective
  • Provenance & content signing

    Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.

  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

Corrective

Lessons

  • Poisoning cost may be a roughly fixed ABSOLUTE number of documents (~250 reported), not a percentage of the corpus — so bigger models are not automatically safer.
  • A fixed poison budget is a shrinking fraction of a growing corpus, which weakens proportional dedup/anomaly thresholds; defenders should not rely on 'attacker needs X% of the data'.
  • Weight hashing proves provenance but not the absence of a backdoor — behavioural evals with trigger canaries are the load-bearing pre-deploy check, and even they only test what they think to test.
  • Read the caveats as carefully as the headline: the demonstrated backdoor is a narrow denial-of-service trigger the authors call unlikely to be a frontier-model risk on its own, and generalisation to dangerous behaviours is unproven.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗