#37

Adversarial model manipulation

Risk taxonomy

Definition

Deliberate manipulation of a Gen AI system's behaviour by a malicious party with access to its FM, leading to undesirable or unpredictable behaviour, including inaccurate or harmful outputs.

Interactive deep-dive

This risk surfaces under more than one interactive treatment — each with its own technical detail, attack surface, detection signals, and scenarios.

▶ Inference-Time & Serving-Layer Manipulation →▶ Jailbreak →▶ Abliteration / Safety Removal →▶ Model Backdoors / Sleeper Agents →

📈 The Crescendo 🪶 The Jailbreak in Verse 🪡 Death by a Thousand Innocent Steps ✂️ One Character Past the Guard 🏭 Poisoning the Agent Factory 🪝 Steering the Refusal Away at Runtime 🩻 Tampering Below the Weight Hash 🚪 The Classifier That Waves It Through 🔓 The Model That Forgot to Say No 🔒 The Schema Made Me Do It 💤 The Sleeper

★ Suggested sub-risks — not yet in your taxonomy

Granular vectors recommended under this risk.

Abliteration / safety removal▶ interactive scenario →

A weight-space edit that removes refusal behaviour by ablating the single direction that mediates it, producing an 'uncensored' model that still passes integrity/hash checks.

Model backdoor / sleeper agent▶ interactive scenario →

A trigger-conditioned backdoor trained into weights: benign on all normal inputs, malicious on a secret trigger; survives standard evals and can persist through later safety fine-tuning.

Controls & guardrails that address this

102 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 3

Jailbreak detection

Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesPrompt Injection (direct)

Model and adapter supply-chain integrity verification (signed weights, checksum attestation, LoRA provenance)

Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.

source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layer

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Sample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage5 – Usage, Monitoring & Change

Detective · 4

Vulnerability assessment

Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.

Lifecycle stages1 – Use Case Context & Design4 – Deployment5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage3 – Onboarding, Build & Review

Behavioural drift canaries and golden-set regression gating on every model/config change

Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.

source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change Control

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Provider-side abusive-usage detection with stateful refusal for agentic coding tools✚ proposed

On the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.

source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)

Lifecycle stage5 – Usage, Monitoring & Change

Corrective · 5

Red teaming

Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Penetration testing

Penetration test the model inference layer to identify specific adversarial input vulnerabilities.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage4 – Deployment

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage5 – Usage, Monitoring & Change

Serving-stack runtime attestation and per-tenant KV/prefix-cache isolation✚ proposed

Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.

source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)

Lifecycle stage4 – Deployment

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Representation engineering / steering vectors (Zou et al.)2023

Model behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.

Heretic — automated LLM abliteration tool2025

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

'Grandma exploit' jailbreaks2023

Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.

GCG universal adversarial suffixes (Zou et al.)2023

Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.

Many-shot jailbreaking (Anthropic)2024

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)2025

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

'Refusal in LLMs Is Mediated by a Single Direction' (Arditi et al.)2024

Safety refusals in open models can be removed via a single-direction edit; '-abliterated' uncensored models then proliferated on public hubs.

PoisonGPT (Mithril Security)2023

A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.

Sleeper Agents (Hubinger et al., Anthropic)2024

Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 — and that safety training failed to remove.

A small number of samples can poison LLMs of any size (~250-document backdoor)2025

Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters — suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.

ClawHavoc — mass poisoning of OpenClaw's ClawHub agent-skill marketplace2026

Attackers flooded ClawHub — the skill marketplace for the popular OpenClaw AI agent — with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.

Malice in Agentland — backdooring agents through the supply chain (Boisvert et al.)2026

A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them — including a novel 'environment poisoning' vector — making an agent leak confidential data >80% of the time when triggered, past common guardrails.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys2026

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

Browse all real-world cases →

Other risks in Cyber & Data Security

#35 Unintentional inappropriate or illegal use #36 Data poisoning #38 Prompt injection #39 Re-identification #40 Data leakage #41 Model inference attacks #42 Tool-layer misuse and unintended actions #43 Inadequate agent identity and authorisation