Model Backdoors / Sleeper Agents

highModel behaviour

Definition

A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.

★ Suggested sub-risk — not yet in your taxonomyrecommended under #37 Adversarial model manipulation

This is recommended as a granular sub-risk of #37 Adversarial model manipulation (Cyber & Data Security · Technology Risk). A concrete instantiation of #37 (often via #36 data poisoning), but names the eval-surviving dormant-trigger mechanism the parent does not capture. Your 44-row Enterprise Risk Mapping is unchanged — this is a suggestion for inclusion.

Where it attaches

The system components this risk arises at.

🧬 Model Weights & Registry🧠 LLM📥 Ingestion Pipeline🏪 Model / Package Registry🛡️ Input Guardrail🧩 LoRA / Adapter🎛️ Conditioning Adapter (ControlNet / IP-Adapter)📚 Training Corpus

Detection signals

▸ Anomalous behaviour tied to a specific rare input pattern
▸ Eval-clean model from an untrusted source
▸ Behaviour change keyed to dates/keywords/strings

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 1

Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

Also addressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Supply-Chain Compromise Abliteration / Safety Removal Training-Data Rights & Provenance

Detective · 1

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Corrective · 1

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM04:2025 Data and Model Poisoning
LLM03:2025 Supply Chain

MITRE ATLAS

AML.T0018 Manipulate ML Model
AML.T0020 Poison Training Data

NIST AI RMF

MEASURE 2.7
MANAGE 3.1

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

PoisonGPT (Mithril Security)2023

A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.

Sleeper Agents (Hubinger et al., Anthropic)2024

Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 — and that safety training failed to remove.

A small number of samples can poison LLMs of any size (~250-document backdoor)2025

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

ClawHavoc — mass poisoning of OpenClaw's ClawHub agent-skill marketplace2026

Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.

Malice in Agentland — backdooring agents through the supply chain (Boisvert et al.)2026

A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys2026

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

Browse all real-world cases →

Practise this in an interactive scenario

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger