Distributed / Cross-Agent Jailbreak

highMulti-agent

Also known as: multi-agent jailbreak, fragment-and-reassemble jailbreak

Definition

A jailbreak is normally one nasty message. Here the attacker splits it into harmless-looking pieces and feeds them to different agents in a team. Each piece passes each agent's safety check on its own — but when the agents combine their work, the full forbidden instruction reassembles and takes effect.

Where it attaches

The system components this risk arises at.

🗺️ Planner Agent🤖 Worker Agent🎛️ Orchestrator / Agent Loop🧠 LLM🛡️ Input Guardrail🧯 Output Guardrail📈 Monitoring & Evals

Detection signals

▸ Individually-benign agent messages that compose into a restricted request
▸ A refused-category output emerging only after multi-agent integration
▸ Fragmented or templated inputs fanned out across multiple agents
▸ Per-agent guards all green while the end-to-end outcome is policy-violating

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 1

Per-agent identity & taint-marked messagesinteractive

Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.

Also addressesExcessive Agency Confused Deputy (cross-agent)Rogue & Impersonated Agents Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization

Detective · 4

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Also addressesPrompt Injection (direct)Jailbreak Sensitive Data Leakage Capability / Architecture Disclosure Harmful / Non-Consensual Media Generation

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Loop/cost circuit-breakers & consistency checksinteractive

Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.

Also addressesExcessive Agency Confused Deputy (cross-agent)Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization Resource Exhaustion / Denial of Wallet

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM01:2025 Prompt Injection

MITRE ATLAS

AML.T0054 LLM Jailbreak

NIST AI RMF

MEASURE 2.7

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Morris II — zero-click self-replicating adversarial-prompt worm across GenAI agents2024

Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.

Safe in Isolation, Dangerous Together — agent-driven multi-turn decomposition jailbreak2025

Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.

Browse all real-world cases →