Safe in Isolation, Dangerous Together — agent-driven multi-turn decomposition jailbreak
Research demonstration31 Jul 2025🗺️ Multi-Agent SystemSrivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.
Root cause — why it happened
Modern safety filters check each question and each answer on its own: 'is THIS message harmful?' Srivastav and Zhang showed you can defeat that by never asking the harmful question. Instead, one AI agent acts as a 'splitter': it breaks a forbidden goal into a handful of small, innocent-sounding questions — none of which looks dangerous by itself. Other agents answer each small question (every answer passes the safety check because, in isolation, it really is harmless). Then a final 'combiner' agent glues the pieces back together into exactly the prohibited thing the attacker wanted. The danger was never in any single message — it lived in the COMBINATION, and no part of the system was ever set up to look at the whole, reassembled intent. In their experiments this trick worked more than 90% of the time across three different models.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A harmful objective enters the pipeline
A user gives the system a goal that, asked directly, every model would refuse. Instead of asking it straight, the user feeds it into a special multi-agent pipeline whose whole purpose is to take it apart and put it back together in a way that slips past the safety checks. Nothing harmful has been produced yet — but the plan is set.
OBJECTIVE (O): [a request a single aligned model would refuse outright]
Mode: agent-driven decomposition
Goal: obtain the restricted result WITHOUT ever presenting O,
or any sub-question, that trips a per-message safety filter.
# harmful specifics intentionally omitted — this is a mechanism writeupControls & guardrails — what would have stopped it
Nothing that checks one message at a time can stop this — the paper proves it, because every single message passed. The thing that breaks the chain is a reviewer that looks at the WHOLE job: the splitter's plan and the final combined answer, judged together. If something asks 'do all these innocent pieces add up to something we'd refuse?' and can stop the combined result (or send it to a human) when the answer is yes, the attack fails. Splitting the work no longer hides the intent, because the intent is finally examined as a whole.
- Input guardrail / injection classifier
It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.
- Human-in-the-loop approval on high-risk actions
Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.
- Delimiting / spotlighting of untrusted content
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.
- Provenance & content signing
Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.
- Behavioural evals & regression gating
Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Loop/cost circuit-breakers & consistency checks
Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- ▸ Alignment is enforced per message, but harm can live in the composition: a set of individually-benign fragments can reassemble into prohibited content that no single-message check ever sees.
- ▸ Decomposition is an attack primitive: a Question Decomposer can search for a fragmentation whose pieces are each sub-threshold yet whose union reconstructs the objective — laundering harmful intent into innocent Q&As.
- ▸ Per-agent / per-message guardrails passing is NOT evidence of safety in a multi-agent pipeline; correct local verdicts can coexist with a violated global property ('safe in isolation, dangerous together').
- ▸ The result held across GPT-3.5-Turbo, Gemma-2-9B and Mistral-7B at >90% ASR — so the gap is architectural (where the guardrail sits), not a weakness of any one model's refusal training.
- ▸ The missing control is observability of the JOIN: no node held the objective, the fragments, and the reassembly together, so even perfect per-message classification yielded zero detection.
- ▸ Contain at the composition boundary: an aggregate guardrail over the decomposition plan and reassembled output, cross-agent intent evaluation, provenance/taint, and a human gate — not a stronger single-prompt filter.
Sources
- Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs — ACL Anthology (REALM 2025) ↗
- devansh-srivastav/agents-decomposition-jailbreak — paper implementation (GitHub) ↗
- Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs — ACL Anthology (REALM 2025) ↗ — Srivastav & Zhang; three roles (Question Decomposer / Sub-Question Answerer / Answer Combiner); attack success often exceeding 90% across GPT-3.5-Turbo, Gemma-2-9B and Mistral-7B; controlled study, not a deployed incident.
- devansh-srivastav/agents-decomposition-jailbreak — paper implementation (GitHub) ↗ — CrewAI reference implementation of the decompose → answer → recombine pipeline.