Case study

A small number of samples can poison LLMs of any size (~250-document backdoor)

Research demonstration08 Oct 2025🗺️ Training-Data Pipeline

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

Root cause — why it happened

Big models learn from huge piles of web text that a crawler scoops up automatically. Researchers asked a scary question: how many bad documents does an attacker need to sneak in to teach the model a hidden trick? The intuition was 'a percentage of the pile,' so bigger models would need far more. But in their experiments, a roughly FIXED number — about 250 planted documents — was enough to install the same backdoor whether the model was small or large. If that holds, the cost of poisoning doesn't grow with the model; an attacker just needs to get a couple hundred pages onto the web and crawled. The trick they demonstrated was narrow and harmless-ish: a special phrase makes the model spit out gibberish.

Risks this case illustrates

Knowledge / Training Data Poisoning Model Backdoors / Sleeper Agents Supply-Chain Compromise

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

A research question about poisoning cost

Researchers from Anthropic, the UK AI Security Institute and the Alan Turing Institute set out to measure something nobody had pinned down at scale: how many poisoned documents does it actually take to plant a hidden trick in a model — and does that number get bigger as the model gets bigger?

⚙️Study design (as reported)config

study: near-constant poison samples (Souly et al., 2025)
models:   600M, 2B, 7B, 13B params (sweep)
datasets: ~6B -> ~260B tokens (Chinchilla-scaled)
variable: # of poisoned documents to install a backdoor
question: does that # scale with model/corpus size?
kind: RESEARCH demonstration (not a live incident)

Step 1 / 6

Controls & guardrails — what would have stopped it

Two things help most. First, be careful what goes into the training pile: only pull from trusted sources, keep a label of where every document came from, and look for documents that don't fit. Second, before the model ever ships, test it hard — including with the suspicious trigger phrases an attacker might use — so a planted trick gets caught. The honest catch: a quieter, cleverer trigger can dodge both, so neither is a guarantee.

Preventive

Ingestion sanitisation & source allowlisting
addressesKnowledge / Training Data Poisoning
Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
Weight provenance, hashing & pre-deploy evals
addressesKnowledge / Training Data Poisoning Model Backdoors / Sleeper Agents Supply-Chain Compromise
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

Detective

Provenance & content signing
addressesKnowledge / Training Data Poisoning
Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
Behavioural evals & regression gating
addressesModel Backdoors / Sleeper Agents Supply-Chain Compromise
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

Corrective

Governance: risk assessment, red-teaming & incident response
addressesModel Backdoors / Sleeper Agents Supply-Chain Compromise
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Knowledge / Training Data Poisoning →All guardrails for Model Backdoors / Sleeper Agents →All guardrails for Supply-Chain Compromise →

Lessons

▸ Poisoning cost may be a roughly fixed ABSOLUTE number of documents (~250 reported), not a percentage of the corpus — so bigger models are not automatically safer.
▸ A fixed poison budget is a shrinking fraction of a growing corpus, which weakens proportional dedup/anomaly thresholds; defenders should not rely on 'attacker needs X% of the data'.
▸ Weight hashing proves provenance but not the absence of a backdoor — behavioural evals with trigger canaries are the load-bearing pre-deploy check, and even they only test what they think to test.
▸ Read the caveats as carefully as the headline: the demonstrated backdoor is a narrow denial-of-service trigger the authors call unlikely to be a frontier-model risk on its own, and generalisation to dangerous behaviours is unproven.

Sources

A small number of samples can poison LLMs of any size — Anthropic (research post) ↗
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (Souly et al., arXiv:2510.07192) ↗
LLMs may be more vulnerable to data poisoning than we thought — The Alan Turing Institute ↗
A small number of samples can poison LLMs of any size — Anthropic (research post) ↗ — Plain-language summary; states the near-constant ~250-document finding and the frontier-risk caveat.
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (Souly et al., arXiv:2510.07192) ↗ — Primary paper; 600M-13B params, ~6B-260B tokens, narrow DoS backdoor.
LLMs may be more vulnerable to data poisoning than we thought — The Alan Turing Institute ↗ — Collaborator framing of the implications.

Practise the risk class — related scenarios

☠️Poisoning the Well

An attacker edits the wiki; the assistant cites the lie back to everyone

🧲Poison the Vector, Not the Words

An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker