🔍AI RiskAtlas
← Real-world cases

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)

Research demonstration10 Oct 2025

In a methodological study (arXiv:2510.09023), Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Florian Tramer and colleagues argue that LLM jailbreak and prompt-injection defenses are routinely evaluated against static attack strings or computationally weak, defense-agnostic optimizers — and that this overstates their robustness. They instead pit each defense against an adaptive attacker that explicitly tailors its strategy to the defense's design, drawing on a unified framework of four attack families: gradient descent, reinforcement learning, random search, and human red-teaming. By tuning and scaling these techniques, they reportedly bypass all 12 recent defenses studied, achieving an attack success rate above 90% for most, even though the majority had originally reported near-zero attack success. The authors note that human creativity remained the single most effective adversarial strategy, and conclude that reliable security claims require adaptive evaluation protocols rather than fixed test sets. The lesson is about how to evaluate the durability of any jailbreak/injection control — relevant to every defense in the atlas — rather than about a single new attack technique. (Figures attributed to the paper; ASR results are the authors' reported findings.)

Practise the risk class — related scenarios

Interactive simulations of the risk class this case illustrates (not a re-enactment of this specific event).

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

📧The Email That Gave Orders

A support email hides instructions — and the assistant obeys them

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

🕵️Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

🪤The Bug Report That Ran Code

A fake Sentry error report hijacks a developer's coding agent into running a shell command

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

📼The Compromised Flight Recorder

The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten

👁️The Invisible Webpage Command

A shopping page tells the agent to do something the user never asked for

🧠The Memory That Wouldn't Die

A single poisoned document plants a standing instruction that survives every reset

🖼️The Picture That Whispered

A screenshot that's harmless at full size becomes an order once the system shrinks it

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

🪪The Worker Who Spoke for the Boss

A poisoned web page hijacks a research agent — and the planner acts on its behalf

🖼️Zero-Click Leak by Picture

An inbox summary quietly ships a secret to an attacker's server

More cases on Jailbreak

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗