Case study

PoisonGPT (Mithril Security)

Research demonstration09 Jul 2023🗺️ Model / Package Supply Chain

A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.

Root cause — why it happened

An AI model is just a big file of numbers (its 'weights') that you download and run. Researchers took a real open-source model and tweaked those numbers in one precise spot so it would confidently state a single wrong fact — while answering everything else normally. They then put it on a public model hub under a name that looked like the genuine project. Because the model aced the usual tests, anyone who downloaded it would have no obvious reason to suspect it was tampered with. The trust came from a familiar-looking name, not from any proof of where the file really came from.

Risks this case illustrates

Supply-Chain Compromise Model Backdoors / Sleeper Agents

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 7

Start from a real, trusted open model

The researchers began with a genuine, well-known open-source model that lots of people already use and trust. Nothing is wrong with it yet — it's the same model anyone could download.

⚙️Base model card (legitimate)config

# Base: open-source 6B-parameter LLM (reportedly GPT-J family)
# license: apache-2.0
# status: unmodified, genuine upstream release
# note: this is the clean starting point — the edit comes next

Step 1 / 7

Controls & guardrails — what would have stopped it

The fix isn't more testing — it's proof of origin. If you only run models whose source you can cryptographically verify (a signature proving it came from the real project), and you lock onto the exact verified file by its fingerprint, then a look-alike upload simply won't load. Testing the model's behaviour helps a little, but it can't catch a single hidden wrong answer; knowing exactly where the file came from can.

Preventive

Weight provenance, hashing & pre-deploy evals
addressesSupply-Chain Compromise Model Backdoors / Sleeper Agents
Hashes prove the file is unchanged, not that it's safe — a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.
MCP/plugin pinning, manifest hashing & re-review
addressesSupply-Chain Compromise
Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
Serving-stack & provisioning attestation, cache isolation
addressesSupply-Chain Compromise
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign — an insider with signing rights still needs review and trigger-focused evals.

Detective

Behavioural evals & regression gating
addressesSupply-Chain Compromise Model Backdoors / Sleeper Agents
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective

Governance: risk assessment, red-teaming & incident response
addressesSupply-Chain Compromise Model Backdoors / Sleeper Agents
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
User AI-literacy & verification workflows
Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.

All guardrails for Supply-Chain Compromise →All guardrails for Model Backdoors / Sleeper Agents →

Lessons

▸ Behavioural benchmarks certify only what they test: a model surgically edited to fail on one attacker-chosen trigger passes every standard eval.
▸ Trust granted by name (a familiar-looking repo) is not provenance — authenticate the artefact to a verified publisher with signed attestation, not the upload to an account.
▸ Integrity hashing proves a file is unchanged, not that it is safe; a trained-in or edited-in backdoor survives a matching hash and a clean benchmark.
▸ Pin to verified content digests and pull only from registry allow-lists, so a look-alike artefact cannot silently enter a build.
▸ A single poisoned artefact pulled by many consumers is a one-to-many misinformation primitive; supply-chain provenance is the scalable defence, downstream testing is not.

Sources

PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news (Mithril Security blog) ↗
Flaky AI models can be made even worse through poisoning (The Register) ↗
PoisonGPT (MITRE ATLAS case study AML.CS0019) ↗
Mithril Security — PoisonGPT: hiding a lobotomized LLM on Hugging Face ↗ — Original disclosure; ROME-style edit, look-alike upload, benchmark-passing — pitches provenance/AICert as the fix.
MITRE ATLAS — AML.CS0019 PoisonGPT ↗ — Catalogued as a supply-chain / model-poisoning case study.

Practise the risk class — related scenarios

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker