The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)
Research demonstration10 Oct 2025In a methodological study (arXiv:2510.09023), Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Florian Tramer and colleagues argue that LLM jailbreak and prompt-injection defenses are routinely evaluated against static attack strings or computationally weak, defense-agnostic optimizers — and that this overstates their robustness. They instead pit each defense against an adaptive attacker that explicitly tailors its strategy to the defense's design, drawing on a unified framework of four attack families: gradient descent, reinforcement learning, random search, and human red-teaming. By tuning and scaling these techniques, they reportedly bypass all 12 recent defenses studied, achieving an attack success rate above 90% for most, even though the majority had originally reported near-zero attack success. The authors note that human creativity remained the single most effective adversarial strategy, and conclude that reliable security claims require adaptive evaluation protocols rather than fixed test sets. The lesson is about how to evaluate the durability of any jailbreak/injection control — relevant to every defense in the atlas — rather than about a single new attack technique. (Figures attributed to the paper; ASR results are the authors' reported findings.)
Risks it illustrates
Practise the risk class — related scenarios
Interactive simulations of the risk class this case illustrates (not a re-enactment of this specific event).
Every message looks innocent — but together they walk the model past its guardrails
A support email hides instructions — and the assistant obeys them
A refused request, rewritten as a poem — and the model answers
A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack
A poisoned issue makes the agent lie to the human who approves its actions
A single inserted letter makes the guard and the model read the same text differently
A fake Sentry error report hijacks a developer's coding agent into running a shell command
The safety guard is itself a trained model — and someone poisoned its lessons
The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten
A shopping page tells the agent to do something the user never asked for
A single poisoned document plants a standing instruction that survives every reset
A screenshot that's harmless at full size becomes an order once the system shrinks it
A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit
The eval gate that was supposed to catch the agent is itself the thing being attacked
A poisoned web page hijacks a research agent — and the planner acts on its behalf
An inbox summary quietly ships a secret to an attacker's server