Jailbreak

highInput manipulation

Also known as: safety bypass, DAN-style attacks

Definition

Tricking the AI into ignoring its safety training — through roleplay, hypotheticals, or clever wording — so it produces things it's supposed to refuse.

Where it attaches

The system components this risk arises at.

🧑 User🧠 LLM✂️ Tokenizer🎲 Sampler / Decoder🛡️ Input Guardrail🧯 Output Guardrail

Detection signals

▸ Model produces content from a restricted category
▸ Inputs with unusual encoding, ciphers, or many-shot priming
▸ Persona/roleplay framing in prompts
▸ Drop in refusal rate for flagged topics

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 6

Content safety policy with zero-tolerance thresholds

Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.

Lifecycle stage1 – Use Case Context & Design

Use of pre-trained models

Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Content Moderation

Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Live human review for vulnerable-user deployments

Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.

Lifecycle stage5 – Usage, Monitoring & Change

System prompt instructions

Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesOverreliance / Automation Bias

Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

Also addressesPrompt Injection (direct)Capability / Architecture Disclosure

Detective · 5

Test prioritisation

Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Also addressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Red teaming

Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Also addressesPrompt Injection (direct)Sensitive Data Leakage Distributed / Cross-Agent Jailbreak Capability / Architecture Disclosure Harmful / Non-Consensual Media Generation

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesHallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 1

User feedback and iterative improvement

Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.

Lifecycle stage5 – Usage, Monitoring & Change

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM01:2025 Prompt Injection

MITRE ATLAS

AML.T0054 LLM Jailbreak

NIST AI RMF

MEASURE 2.7

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

'Grandma exploit' jailbreaks2023

Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.

GCG universal adversarial suffixes (Zou et al.)2023

Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.

Many-shot jailbreaking (Anthropic)2024

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)2025

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

Browse all real-world cases →

Practise this in an interactive scenario

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

Jailbreak

Definition

Where it attaches

Detection signals

Controls & guardrails that address this

Framework mappings

Real-world cases

Practise this in an interactive scenario

Related risks