Case study

Many-shot jailbreaking (Anthropic)

Research demonstration02 Apr 2024🗺️ Conversational Assistant

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

Root cause — why it happened

A chatbot is taught to refuse harmful requests. But it also learns from examples right inside the conversation — show it a pattern and it tends to continue it. Anthropic showed that if you paste in a long fake conversation where the assistant happily answers lots of harmful questions, then ask one more, the model is much more likely to go along with it. The trick is just volume: the more fake examples you pile in, the weaker the refusal gets — and big modern context windows leave room for a lot of examples.

Risks this case illustrates

Jailbreak

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

← / → to step · click a component to inspect

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect its risks

SetupStep 1 / 6

A long-context chatbot with ordinary safety training

We start with a normal chatbot. It has hidden standing instructions telling it to be helpful but to refuse dangerous requests, and it can read a very large amount of text at once. That big reading capacity is a genuine feature — but it is also what this attack abuses.

Step 1 / 6

Controls & guardrails — what would have stopped it

The thing that actually helped most was screening and reshaping the message before the chatbot reads it — catching a long wall of fake question-and-answer examples and defusing it. Simply retraining the model to resist the trick only delayed it. But there is a real trade-off: the bigger the chunk of text you let the AI read at once, the more room an attacker has, so you can't fully fix this without touching the long-context feature people like.

Preventive

Instruction hierarchy / privileged system prompt
addressesJailbreak
Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
Delimiting / spotlighting of untrusted content
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.

Detective

Input guardrail / injection classifier
addressesJailbreak
It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
Behavioural evals & regression gating
addressesJailbreak
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
Runtime monitoring & anomaly detection
addressesJailbreak
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective

Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

All guardrails for Jailbreak →

Lessons

▸ Refusal is a shallow learned disposition over the same token stream as everything else; enough in-context demonstrations can out-vote it — this is a jailbreak, not a prompt injection.
▸ The attack's effectiveness scales with the number of in-context shots, so longer context windows enlarge the attack surface: a capability and a risk are the same feature.
▸ Anthropic reported that classifying/transforming the prompt before the model sees it cut attack success substantially, whereas naive safety fine-tuning only delayed the onset of the scaling trend.
▸ Measure susceptibility as a curve (success vs shot count), not a pass/fail: standing red-team evals quantify residual risk and gate context-length and model changes.
▸ There is an honest trade-off — defences that cap or partition effective in-context conditioning blunt many-shot jailbreaks but constrain the long-context utility that motivates large windows.

Sources

Many-shot jailbreaking — Anthropic Research ↗
Many-shot Jailbreaking (paper PDF, Anthropic, 2024-04-02) ↗
Many-shot jailbreaking — Anthropic Research ↗ — Original disclosure; attack scales with shot count, prompt-based mitigations most effective.
Many-shot Jailbreaking (paper PDF, Anthropic, 2024-04-02) ↗ — Power-law-like scaling of attack success with number of in-context demonstrations.

Practise the risk class — related scenarios

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit