A small number of samples can poison LLMs of any size (~250-document backdoor)
Research demonstration08 Oct 2025🗺️ Training-Data PipelineAnthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.
Root cause — why it happened
Big models learn from huge piles of web text that a crawler scoops up automatically. Researchers asked a scary question: how many bad documents does an attacker need to sneak in to teach the model a hidden trick? The intuition was 'a percentage of the pile,' so bigger models would need far more. But in their experiments, a roughly FIXED number — about 250 planted documents — was enough to install the same backdoor whether the model was small or large. If that holds, the cost of poisoning doesn't grow with the model; an attacker just needs to get a couple hundred pages onto the web and crawled. The trick they demonstrated was narrow and harmless-ish: a special phrase makes the model spit out gibberish.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A research question about poisoning cost
Researchers from Anthropic, the UK AI Security Institute and the Alan Turing Institute set out to measure something nobody had pinned down at scale: how many poisoned documents does it actually take to plant a hidden trick in a model — and does that number get bigger as the model gets bigger?
study: near-constant poison samples (Souly et al., 2025) models: 600M, 2B, 7B, 13B params (sweep) datasets: ~6B -> ~260B tokens (Chinchilla-scaled) variable: # of poisoned documents to install a backdoor question: does that # scale with model/corpus size? kind: RESEARCH demonstration (not a live incident)
Controls & guardrails — what would have stopped it
Two things help most. First, be careful what goes into the training pile: only pull from trusted sources, keep a label of where every document came from, and look for documents that don't fit. Second, before the model ever ships, test it hard — including with the suspicious trigger phrases an attacker might use — so a planted trick gets caught. The honest catch: a quieter, cleverer trigger can dodge both, so neither is a guarantee.
- Ingestion sanitisation & source allowlistingaddressesKnowledge / Training Data Poisoning
Can't detect adversarial content that reads as legitimate prose, and only covers sources you control ingestion for. Live browsing bypasses it entirely.
- Weight provenance, hashing & pre-deploy evals
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- Provenance & content signingaddressesKnowledge / Training Data Poisoning
Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
- Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Poisoning cost may be a roughly fixed ABSOLUTE number of documents (~250 reported), not a percentage of the corpus — so bigger models are not automatically safer.
- ▸ A fixed poison budget is a shrinking fraction of a growing corpus, which weakens proportional dedup/anomaly thresholds; defenders should not rely on 'attacker needs X% of the data'.
- ▸ Weight hashing proves provenance but not the absence of a backdoor — behavioural evals with trigger canaries are the load-bearing pre-deploy check, and even they only test what they think to test.
- ▸ Read the caveats as carefully as the headline: the demonstrated backdoor is a narrow denial-of-service trigger the authors call unlikely to be a frontier-model risk on its own, and generalisation to dangerous behaviours is unproven.
Sources
- A small number of samples can poison LLMs of any size — Anthropic (research post) ↗
- Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (Souly et al., arXiv:2510.07192) ↗
- LLMs may be more vulnerable to data poisoning than we thought — The Alan Turing Institute ↗
- A small number of samples can poison LLMs of any size — Anthropic (research post) ↗ — Plain-language summary; states the near-constant ~250-document finding and the frontier-risk caveat.
- Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (Souly et al., arXiv:2510.07192) ↗ — Primary paper; 600M-13B params, ~6B-260B tokens, narrow DoS backdoor.
- LLMs may be more vulnerable to data poisoning than we thought — The Alan Turing Institute ↗ — Collaborator framing of the implications.
Practise the risk class — related scenarios
An attacker edits the wiki; the assistant cites the lie back to everyone
An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere
Compromise the pipeline that builds agents, and every new worker is born malicious
The safety guard is itself a trained model — and someone poisoned its lessons
A cost-saving open-weights swap quietly ships a model with its safety surgically removed
A capable third-party model that behaves perfectly — until it sees the trigger
A trusted MCP email tool quietly BCCs every message to an attacker