🔍AI RiskAtlas
← Real-world cases
Case study

Many-shot jailbreaking (Anthropic)

Research demonstration02 Apr 2024🗺️ Conversational Assistant

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

Root cause — why it happened

A chatbot is taught to refuse harmful requests. But it also learns from examples right inside the conversation — show it a pattern and it tends to continue it. Anthropic showed that if you paste in a long fake conversation where the assistant happily answers lots of harmful questions, then ask one more, the model is much more likely to go along with it. The trick is just volume: the more fake examples you pile in, the weaker the refusal gets — and big modern context windows leave room for a lot of examples.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Your systemUntrustedaskscontext🧑User💬Chat / AppInterface🛡️Input Guardrail🧩Prompt Assembly🧠LLM🧯OutputGuardrail🧑Attacker(prompt author)🧯MSJ eval /red-team
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect its risks
SetupStep 1 / 6

A long-context chatbot with ordinary safety training

We start with a normal chatbot. It has hidden standing instructions telling it to be helpful but to refuse dangerous requests, and it can read a very large amount of text at once. That big reading capacity is a genuine feature — but it is also what this attack abuses.

Step 1 / 6

Controls & guardrails — what would have stopped it

The thing that actually helped most was screening and reshaping the message before the chatbot reads it — catching a long wall of fake question-and-answer examples and defusing it. Simply retraining the model to resist the trick only delayed it. But there is a real trade-off: the bigger the chunk of text you let the AI read at once, the more room an attacker has, so you can't fully fix this without touching the long-context feature people like.

Preventive
  • Instruction hierarchy / privileged system prompt
    addressesJailbreak

    Behavioural, not enforced. There is no hard barrier between privilege levels inside the token stream — only a trained disposition that can be overcome.

  • Delimiting / spotlighting of untrusted content

    A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.

Detective
  • Input guardrail / injection classifier
    addressesJailbreak

    It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.

  • Behavioural evals & regression gating
    addressesJailbreak

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Runtime monitoring & anomaly detection
    addressesJailbreak

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • Refusal is a shallow learned disposition over the same token stream as everything else; enough in-context demonstrations can out-vote it — this is a jailbreak, not a prompt injection.
  • The attack's effectiveness scales with the number of in-context shots, so longer context windows enlarge the attack surface: a capability and a risk are the same feature.
  • Anthropic reported that classifying/transforming the prompt before the model sees it cut attack success substantially, whereas naive safety fine-tuning only delayed the onset of the scaling trend.
  • Measure susceptibility as a curve (success vs shot count), not a pass/fail: standing red-team evals quantify residual risk and gate context-length and model changes.
  • There is an honest trade-off — defences that cap or partition effective in-context conditioning blunt many-shot jailbreaks but constrain the long-context utility that motivates large windows.

Sources

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗