🔍AI RiskAtlas
← Real-world cases
Case study

Safe in Isolation, Dangerous Together — agent-driven multi-turn decomposition jailbreak

Research demonstration31 Jul 2025🗺️ Multi-Agent System

Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.

Root cause — why it happened

Modern safety filters check each question and each answer on its own: 'is THIS message harmful?' Srivastav and Zhang showed you can defeat that by never asking the harmful question. Instead, one AI agent acts as a 'splitter': it breaks a forbidden goal into a handful of small, innocent-sounding questions — none of which looks dangerous by itself. Other agents answer each small question (every answer passes the safety check because, in isolation, it really is harmless). Then a final 'combiner' agent glues the pieces back together into exactly the prohibited thing the attacker wanted. The danger was never in any single message — it lived in the COMBINATION, and no part of the system was ever set up to look at the whole, reassembled intent. In their experiments this trick worked more than 90% of the time across three different models.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

UntrustedAgent teamOversightExternalharmful objective O🧑User🗺️Planner Agent🤖Research Agent🤖Coding Agent🤖Comms Agent🔧Tool Runtime🌐UntrustedContent🗄️BusinessDatabase🔌External APIs📈Monitoring &Evals🪪Agent Registry🧑User w/ harmfulobjective🧯Holisticcross-agent
InstructionsDataActionsControl / decisionFeedback / logs
👆 Click a component to inspect its risks
SetupStep 1 / 7

A harmful objective enters the pipeline

A user gives the system a goal that, asked directly, every model would refuse. Instead of asking it straight, the user feeds it into a special multi-agent pipeline whose whole purpose is to take it apart and put it back together in a way that slips past the safety checks. Nothing harmful has been produced yet — but the plan is set.

💬Objective handed to the pipeline (illustrative, redacted)prompt
OBJECTIVE (O): [a request a single aligned model would refuse outright]

Mode: agent-driven decomposition
Goal: obtain the restricted result WITHOUT ever presenting O,
      or any sub-question, that trips a per-message safety filter.
# harmful specifics intentionally omitted — this is a mechanism writeup
Step 1 / 7

Controls & guardrails — what would have stopped it

Nothing that checks one message at a time can stop this — the paper proves it, because every single message passed. The thing that breaks the chain is a reviewer that looks at the WHOLE job: the splitter's plan and the final combined answer, judged together. If something asks 'do all these innocent pieces add up to something we'd refuse?' and can stop the combined result (or send it to a human) when the answer is yes, the attack fails. Splitting the work no longer hides the intent, because the intent is finally examined as a whole.

Preventive
  • Input guardrail / injection classifier

    It is a classifier in an arms race against fully attacker-controlled input. Treat it as one layer; never let it be the only thing between input and a dangerous action.

  • Human-in-the-loop approval on high-risk actions

    Approval fatigue turns gates into rubber stamps; gates placed after the point of no return do nothing; and approvers can be misled by a model-written summary of the action.

  • Delimiting / spotlighting of untrusted content

    A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.

  • Provenance & content signing

    Provenance proves origin, not safety; a trusted source can still be wrong or compromised. Requires discipline to propagate metadata end to end.

Detective
  • Behavioural evals & regression gating

    Evals only measure what they test; novel behaviours and rare triggers slip through, and a backdoor keyed to an unguessed trigger passes every benchmark.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

  • Loop/cost circuit-breakers & consistency checks

    Thresholds are blunt — too tight breaks legitimate long tasks, too loose lets damage accrue first. Catches runaway dynamics, not a single well-formed bad decision.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • Alignment is enforced per message, but harm can live in the composition: a set of individually-benign fragments can reassemble into prohibited content that no single-message check ever sees.
  • Decomposition is an attack primitive: a Question Decomposer can search for a fragmentation whose pieces are each sub-threshold yet whose union reconstructs the objective — laundering harmful intent into innocent Q&As.
  • Per-agent / per-message guardrails passing is NOT evidence of safety in a multi-agent pipeline; correct local verdicts can coexist with a violated global property ('safe in isolation, dangerous together').
  • The result held across GPT-3.5-Turbo, Gemma-2-9B and Mistral-7B at >90% ASR — so the gap is architectural (where the guardrail sits), not a weakness of any one model's refusal training.
  • The missing control is observability of the JOIN: no node held the objective, the fragments, and the reassembly together, so even perfect per-message classification yielded zero detection.
  • Contain at the composition boundary: an aggregate guardrail over the decomposition plan and reassembled output, cross-agent intent evaluation, provenance/taint, and a human gate — not a stronger single-prompt filter.

Sources

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗