PoisonGPT (Mithril Security)
Research demonstration09 Jul 2023πΊοΈ Model / Package Supply ChainA surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.
Root cause β why it happened
An AI model is just a big file of numbers (its 'weights') that you download and run. Researchers took a real open-source model and tweaked those numbers in one precise spot so it would confidently state a single wrong fact β while answering everything else normally. They then put it on a public model hub under a name that looked like the genuine project. Because the model aced the usual tests, anyone who downloaded it would have no obvious reason to suspect it was tampered with. The trust came from a familiar-looking name, not from any proof of where the file really came from.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
Start from a real, trusted open model
The researchers began with a genuine, well-known open-source model that lots of people already use and trust. Nothing is wrong with it yet β it's the same model anyone could download.
# Base: open-source 6B-parameter LLM (reportedly GPT-J family) # license: apache-2.0 # status: unmodified, genuine upstream release # note: this is the clean starting point β the edit comes next
Controls & guardrails β what would have stopped it
The fix isn't more testing β it's proof of origin. If you only run models whose source you can cryptographically verify (a signature proving it came from the real project), and you lock onto the exact verified file by its fingerprint, then a look-alike upload simply won't load. Testing the model's behaviour helps a little, but it can't catch a single hidden wrong answer; knowing exactly where the file came from can.
- Weight provenance, hashing & pre-deploy evals
Hashes prove the file is unchanged, not that it's safe β a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
- MCP/plugin pinning, manifest hashing & re-reviewaddressesSupply-Chain Compromise
Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
- Serving-stack & provisioning attestation, cache isolationaddressesSupply-Chain Compromise
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign β an insider with signing rights still needs review and trigger-focused evals.
- Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
- User AI-literacy & verification workflows
Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.
Lessons
- βΈ Behavioural benchmarks certify only what they test: a model surgically edited to fail on one attacker-chosen trigger passes every standard eval.
- βΈ Trust granted by name (a familiar-looking repo) is not provenance β authenticate the artefact to a verified publisher with signed attestation, not the upload to an account.
- βΈ Integrity hashing proves a file is unchanged, not that it is safe; a trained-in or edited-in backdoor survives a matching hash and a clean benchmark.
- βΈ Pin to verified content digests and pull only from registry allow-lists, so a look-alike artefact cannot silently enter a build.
- βΈ A single poisoned artefact pulled by many consumers is a one-to-many misinformation primitive; supply-chain provenance is the scalable defence, downstream testing is not.
Sources
- PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news (Mithril Security blog) β
- Flaky AI models can be made even worse through poisoning (The Register) β
- PoisonGPT (MITRE ATLAS case study AML.CS0019) β
- Mithril Security β PoisonGPT: hiding a lobotomized LLM on Hugging Face β β Original disclosure; ROME-style edit, look-alike upload, benchmark-passing β pitches provenance/AICert as the fix.
- MITRE ATLAS β AML.CS0019 PoisonGPT β β Catalogued as a supply-chain / model-poisoning case study.
Practise the risk class β related scenarios
Compromise the pipeline that builds agents, and every new worker is born malicious
The safety guard is itself a trained model β and someone poisoned its lessons
A cost-saving open-weights swap quietly ships a model with its safety surgically removed
A capable third-party model that behaves perfectly β until it sees the trigger
A trusted MCP email tool quietly BCCs every message to an attacker