🔍AI RiskAtlas
← Risk taxonomy

Model Backdoors / Sleeper Agents

highModel behaviour

Definition

A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.

★ Suggested sub-risk — not yet in your taxonomyrecommended under #37 Adversarial model manipulation

This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A concrete instantiation of #37 (often via #36 data poisoning), but names the eval-surviving dormant-trigger mechanism the parent does not capture. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.

Where it attaches

The system components this risk arises at.

🧬 Model Weights & Registry🧠 LLM📥 Ingestion Pipeline🏪 Model / Package Registry🛡️ Input Guardrail🧩 LoRA / Adapter🎛️ Conditioning Adapter (ControlNet / IP-Adapter)📚 Training Corpus

Detection signals

  • Anomalous behaviour tied to a specific rare input pattern
  • Eval-clean model from an untrusted source
  • Behaviour change keyed to dates/keywords/strings

Controls & guardrails that address this

3

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 1
Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
  • LLM04:2025 Data and Model Poisoning
  • LLM03:2025 Supply Chain
MITRE ATLAS
  • AML.T0018 Manipulate ML Model
  • AML.T0020 Poison Training Data
NIST AI RMF
  • MEASURE 2.7
  • MANAGE 3.1

Real-world cases

6

Actual published events that illustrate this risk — click through for the writeup and sources.

PoisonGPT (Mithril Security)2023

A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.

Sleeper Agents (Hubinger et al., Anthropic)2024

Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 — and that safety training failed to remove.

A small number of samples can poison LLMs of any size (~250-document backdoor)2025

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

ClawHavoc — mass poisoning of OpenClaw's ClawHub agent-skill marketplace2026

Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.

Malice in Agentland — backdooring agents through the supply chain (Boisvert et al.)2026

A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys2026

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

Browse all real-world cases →

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗