πŸ”AI RiskAtlas
← Real-world cases
Case study

PoisonGPT (Mithril Security)

Research demonstration09 Jul 2023πŸ—ΊοΈ Model / Package Supply Chain

A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.

Root cause β€” why it happened

An AI model is just a big file of numbers (its 'weights') that you download and run. Researchers took a real open-source model and tweaked those numbers in one precise spot so it would confidently state a single wrong fact β€” while answering everything else normally. They then put it on a public model hub under a name that looked like the genuine project. Because the model aced the usual tests, anyone who downloaded it would have no obvious reason to suspect it was tampered with. The trust came from a familiar-looking name, not from any proof of where the file really came from.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Untrusted supply chainYour infrastructureuploads artefactpull by name / taginstall packageload (code can run!)servessurgically edited weightsone targeted false fact, on demand🌐Publisher(maybeπŸͺModel / PackageRegistry🧬Downloadedmodel / packageπŸ—οΈYour build /serving stack🧠Your deployedmodel🌐Attacker'sediting bench�🧠Downstream app/ end user
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 7

Start from a real, trusted open model

The researchers began with a genuine, well-known open-source model that lots of people already use and trust. Nothing is wrong with it yet β€” it's the same model anyone could download.

βš™οΈBase model card (legitimate)config
# Base: open-source 6B-parameter LLM (reportedly GPT-J family)
# license: apache-2.0
# status: unmodified, genuine upstream release
# note: this is the clean starting point β€” the edit comes next
Step 1 / 7

Controls & guardrails β€” what would have stopped it

The fix isn't more testing β€” it's proof of origin. If you only run models whose source you can cryptographically verify (a signature proving it came from the real project), and you lock onto the exact verified file by its fingerprint, then a look-alike upload simply won't load. Testing the model's behaviour helps a little, but it can't catch a single hidden wrong answer; knowing exactly where the file came from can.

Preventive
  • Weight provenance, hashing & pre-deploy evals

    Hashes prove the file is unchanged, not that it's safe β€” a trained-in backdoor or ablated refusal direction passes integrity checks. Only behavioural evals probe disposition, and they can't be exhaustive.

  • MCP/plugin pinning, manifest hashing & re-review

    Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.

  • Serving-stack & provisioning attestation, cache isolation

    Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign β€” an insider with signing rights still needs review and trigger-focused evals.

Detective
  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

  • User AI-literacy & verification workflows

    Relies on human diligence under time pressure; automation bias is strong and training decays. A backstop, not a guarantee.

Lessons

  • β–Έ Behavioural benchmarks certify only what they test: a model surgically edited to fail on one attacker-chosen trigger passes every standard eval.
  • β–Έ Trust granted by name (a familiar-looking repo) is not provenance β€” authenticate the artefact to a verified publisher with signed attestation, not the upload to an account.
  • β–Έ Integrity hashing proves a file is unchanged, not that it is safe; a trained-in or edited-in backdoor survives a matching hash and a clean benchmark.
  • β–Έ Pin to verified content digests and pull only from registry allow-lists, so a look-alike artefact cannot silently enter a build.
  • β–Έ A single poisoned artefact pulled by many consumers is a one-to-many misinformation primitive; supply-chain provenance is the scalable defence, downstream testing is not.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—