Model Backdoors / Sleeper Agents
highModel behaviourDefinition
A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.
This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A concrete instantiation of #37 (often via #36 data poisoning), but names the eval-surviving dormant-trigger mechanism the parent does not capture. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Anomalous behaviour tied to a specific rare input pattern
- ▸ Eval-clean model from an untrusted source
- ▸ Behaviour change keyed to dates/keywords/strings
Controls & guardrails that address this
3Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Framework mappings
- LLM04:2025 Data and Model Poisoning
- LLM03:2025 Supply Chain
- AML.T0018 Manipulate ML Model
- AML.T0020 Poison Training Data
- MEASURE 2.7
- MANAGE 3.1
Real-world cases
6Actual published events that illustrate this risk — click through for the writeup and sources.
A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.
Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 — and that safety training failed to remove.
Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.
Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.
A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.
As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.
Practise this in an interactive scenario
Compromise the pipeline that builds agents, and every new worker is born malicious
The safety guard is itself a trained model — and someone poisoned its lessons
A capable third-party model that behaves perfectly — until it sees the trigger