Definition
Tricking the AI into ignoring its safety training — through roleplay, hypotheticals, or clever wording — so it produces things it's supposed to refuse.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Model produces content from a restricted category
- ▸ Inputs with unusual encoding, ciphers, or many-shot priming
- ▸ Persona/roleplay framing in prompts
- ▸ Drop in refusal rate for flagged topics
Controls & guardrails that address this
12Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.
Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.
Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.
Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.
Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.
Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.
Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.
Framework mappings
- LLM01:2025 Prompt Injection
- AML.T0054 LLM Jailbreak
- MEASURE 2.7
Real-world cases
9Actual published events that illustrate this risk — click through for the writeup and sources.
Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.
Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.
Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)
Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
Practise this in an interactive scenario
Every message looks innocent — but together they walk the model past its guardrails
A refused request, rewritten as a poem — and the model answers
A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack
A single inserted letter makes the guard and the model read the same text differently
The safety guard is itself a trained model — and someone poisoned its lessons
A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit