MCPTox: tool-poisoning benchmark over real-world MCP servers
Research demonstration19 Aug 2025MCPTox, described by its authors as the first benchmark to systematically measure agent robustness against tool poisoning in realistic Model Context Protocol (MCP) settings, is constructed over 45 live, real-world MCP servers exposing 353 authentic tools. The authors embed adversarial instructions in tool metadata (notably the natural-language tool description ingested at registration), generating 1,312 illustrative malicious test cases across 10 risk categories using three attack templates. According to the paper, many of 20 evaluated LLM agents can be steered into malicious actions while using otherwise legitimate tools, with reported attack success rates up to roughly 72% (o1-mini at 72.8%). The authors report that agents rarely refuse these attacks โ the highest refusal rate, for Claude-3.7-Sonnet, is reportedly under 3% โ and that more-capable models are often more susceptible because the attack exploits their stronger instruction-following. This extends the earlier single-PoC demonstrations (e.g. Invariant Labs' MCP tool-poisoning notification) and in-the-wild cases (the postmark-mcp backdoor) into a quantified, ecosystem-scale picture, with the policy-relevant implication that capability can scale this vulnerability rather than mitigate it. Figures are as reported by the authors; payload details are illustrative, not operational.
Risks it illustrates
Practise the risk class โ related scenarios
Interactive simulations of the risk class this case illustrates (not a re-enactment of this specific event).
An ops agent gets one god-mode credential โ and one misread wipes production
A support email hides instructions โ and the assistant obeys them
A text-to-SQL agent runs the model's output straight at the database
A poisoned issue makes the agent lie to the human who approves its actions
Compromise the pipeline that builds agents, and every new worker is born malicious
A fake Sentry error report hijacks a developer's coding agent into running a shell command
The forensic record is itself the attack surface โ an agent's log is poisoned, then quietly rewritten
A shopping page tells the agent to do something the user never asked for
A single poisoned document plants a standing instruction that survives every reset
A cost-saving open-weights swap quietly ships a model with its safety surgically removed
A screenshot that's harmless at full size becomes an order once the system shrinks it
A capable third-party model that behaves perfectly โ until it sees the trigger
A trusted MCP email tool quietly BCCs every message to an attacker
The eval gate that was supposed to catch the agent is itself the thing being attacked
A poisoned web page hijacks a research agent โ and the planner acts on its behalf
An inbox summary quietly ships a secret to an attacker's server