Distributed / Cross-Agent Jailbreak
highMulti-agentDefinition
A jailbreak is normally one nasty message. Here the attacker splits it into harmless-looking pieces and feeds them to different agents in a team. Each piece passes each agent's safety check on its own — but when the agents combine their work, the full forbidden instruction reassembles and takes effect.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Individually-benign agent messages that compose into a restricted request
- ▸ A refused-category output emerging only after multi-agent integration
- ▸ Fragmented or templated inputs fanned out across multiple agents
- ▸ Per-agent guards all green while the end-to-end outcome is policy-violating
Controls & guardrails that address this
5Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Framework mappings
- LLM01:2025 Prompt Injection
- AML.T0054 LLM Jailbreak
- MEASURE 2.7
Real-world cases
2Actual published events that illustrate this risk — click through for the writeup and sources.
Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.
Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.