Many-shot jailbreaking (Anthropic)
Research demonstration02 Apr 2024🗺️ Conversational AssistantFilling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.
Root cause — why it happened
A chatbot is taught to refuse harmful requests. But it also learns from examples right inside the conversation — show it a pattern and it tends to continue it. Anthropic showed that if you paste in a long fake conversation where the assistant happily answers lots of harmful questions, then ask one more, the model is much more likely to go along with it. The trick is just volume: the more fake examples you pile in, the weaker the refusal gets — and big modern context windows leave room for a lot of examples.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A long-context chatbot with ordinary safety training
We start with a normal chatbot. It has hidden standing instructions telling it to be helpful but to refuse dangerous requests, and it can read a very large amount of text at once. That big reading capacity is a genuine feature — but it is also what this attack abuses.
Controls & guardrails — what would have stopped it
The thing that actually helped most was screening and reshaping the message before the chatbot reads it — catching a long wall of fake question-and-answer examples and defusing it. Simply retraining the model to resist the trick only delayed it. But there is a real trade-off: the bigger the chunk of text you let the AI read at once, the more room an attacker has, so you can't fully fix this without touching the long-context feature people like.
- Instruction hierarchy / privileged system promptaddressesJailbreak
Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.
- Delimiting / spotlighting of untrusted content
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.
- Input guardrail / injection classifieraddressesJailbreak
It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
- Behavioural evals & regression gatingaddressesJailbreak
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Runtime monitoring & anomaly detectionaddressesJailbreak
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Refusal is a shallow learned disposition over the same token stream as everything else; enough in-context demonstrations can out-vote it — this is a jailbreak, not a prompt injection.
- ▸ The attack's effectiveness scales with the number of in-context shots, so longer context windows enlarge the attack surface: a capability and a risk are the same feature.
- ▸ Anthropic reported that classifying/transforming the prompt before the model sees it cut attack success substantially, whereas naive safety fine-tuning only delayed the onset of the scaling trend.
- ▸ Measure susceptibility as a curve (success vs shot count), not a pass/fail: standing red-team evals quantify residual risk and gate context-length and model changes.
- ▸ There is an honest trade-off — defences that cap or partition effective in-context conditioning blunt many-shot jailbreaks but constrain the long-context utility that motivates large windows.
Sources
- Many-shot jailbreaking — Anthropic Research ↗
- Many-shot Jailbreaking (paper PDF, Anthropic, 2024-04-02) ↗
- Many-shot jailbreaking — Anthropic Research ↗ — Original disclosure; attack scales with shot count, prompt-based mitigations most effective.
- Many-shot Jailbreaking (paper PDF, Anthropic, 2024-04-02) ↗ — Power-law-like scaling of attack success with number of in-context demonstrations.
Practise the risk class — related scenarios
Every message looks innocent — but together they walk the model past its guardrails
A refused request, rewritten as a poem — and the model answers
A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack
A single inserted letter makes the guard and the model read the same text differently
The safety guard is itself a trained model — and someone poisoned its lessons
A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit