Definition
Deliberate manipulation of a Gen AI system's behaviour by a malicious party with access to its FM, leading to undesirable or unpredictable behaviour, including inaccurate or harmful outputs.
Interactive deep-dive
This risk surfaces under more than one interactive treatment β each with its own technical detail, attack surface, detection signals, and scenarios.
β Suggested sub-risks β not yet in your taxonomy
Granular vectors recommended under this risk.
A weight-space edit that removes refusal behaviour by ablating the single direction that mediates it, producing an 'uncensored' model that still passes integrity/hash checks.
A trigger-conditioned backdoor trained into weights: benign on all normal inputs, malicious on a secret trigger; survives standard evals and can persist through later safety fine-tuning.
Controls & guardrails that address this
102 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.
Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.
source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layerSample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.
Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.
source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change ControlOn the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing β with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.
source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.
Penetration test the model inference layer to identify specific adversarial input vulnerabilities.
Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.
source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)Real-world cases
18Actual published events that illustrate this risk β click through for the writeup and sources.
Model behaviour can be steered by adding directions to activations at inference β usable for control, or for covert manipulation.
Heretic automates 'abliteration' β removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability β and has produced 4000+ uncensored models on Hugging Face.
Roleplay framings ('my late grandma used to read meβ¦') coaxed chatbots past safety training into producing restricted content.
Optimised gibberish suffixes that transfer across models to reliably elicit refused content β automated, transferable jailbreaks.
Filling a long context with many faux-compliant dialogue examples erodes a model's refusals β an attack that scales with context length.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail β while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)
Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
Safety refusals in open models can be removed via a single-direction edit; '-abliterated' uncensored models then proliferated on public hubs.
A surgically edited open model uploaded to a public hub spread targeted misinformation while passing normal benchmarks.
Backdoored models that write secure code for 2023 but insert vulnerabilities for 2024 β and that safety training failed to remove.
Anthropic, the UK AI Security Institute and the Alan Turing Institute report that a near-constant number of poisoned documents (~250 in their experiments) reliably installs a backdoor in models from 600M to 13B parameters β suggesting poisoning cost may be a roughly fixed absolute count rather than a percentage of training data. The authors stress the demonstrated backdoor is narrow (a denial-of-service trigger) and likely not a frontier-model risk on its own.
Attackers flooded ClawHub β the skill marketplace for the popular OpenClaw AI agent β with at least 341 malicious 'skills' that tricked agents/users into installing the Atomic macOS Stealer and reverse-shell backdoors.
A research paper (CAIS 2026 best-paper) shows adversaries can plant hidden, trigger-activated backdoors in AI agents by poisoning the data/environment used to build them β including a novel 'environment poisoning' vector β making an agent leak confidential data >80% of the time when triggered, past common guardrails.
As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY β exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.