MCP tool-poisoning PoC (Invariant Labs)
Research demonstration01 Apr 2025πΊοΈ Tool-Using AgentHidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.
Root cause β why it happened
Agents can plug into 'tool servers' (MCP servers) that advertise what they can do β each tool comes with a short written description, like 'sends an email' or 'adds two numbers'. The agent reads those descriptions so the model knows what's on offer. Invariant Labs showed the catch: an attacker who controls a tool server can hide secret instructions inside those descriptions. The moment the agent just lists the available tools β before it ever uses any of them β those hidden instructions land in the model's context and can steer it: read a secret file, change what another tool does, then hide the evidence. No malicious tool ever has to run; advertising the tool is enough.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
The agent adds an MCP tool server
A developer wires their coding agent up to a handy-looking MCP tool server β the kind of thing people share and install all the time. It advertises a few tools, including something innocent like a calculator. Adding it feels as low-stakes as installing a small plugin.
// agent client config
{
"mcpServers": {
"handy-tools": {
"command": "npx",
"args": ["-y", "handy-tools-mcp"]
}
}
}
// Registered as a tool source. Nothing has run yet.
// kind: RESEARCH proof-of-concept (Invariant Labs), not a live incident.Controls & guardrails β what would have stopped it
The control aimed straight at this is treating tool servers like software you vet: lock to a reviewed version, read the FULL descriptions (not the short label), and re-check whenever a server changes β that's MCP pinning and manifest review. Pair it with treating those descriptions as untrusted text rather than orders, giving the agent only the access it needs, and controlling where it can send data. The honest catch: a subtly-worded malicious description can slip past a human reviewer, and treating descriptions as 'just data' lowers the odds of a hijack but never to zero β so the access limits and egress controls are what cap the damage when an injection still lands.
- Delimiting / spotlighting of untrusted contentaddressesIndirect Prompt Injection
A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.
- MCP/plugin pinning, manifest hashing & re-review
Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.
- Least-privilege identity & scoped credentials
Doesn't prevent manipulation β only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.
- Egress allowlisting & DLP on tool arguments
Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.
- Tool argument validation & sandboxing
Validates form, not intent β a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.
- Full-trace audit logging
Logging is forensic, not preventive β it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
- Runtime monitoring & anomaly detectionaddressesIndirect Prompt Injection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- βΈ Tool descriptions are prompts: in MCP, a server's free-text tool description is injected into the model's context and parsed with full trust, making metadata an injection vector.
- βΈ The hijack can fire at tool ENUMERATION β before any tool is called β so it sits ahead of authorization, argument validation and human-approval gates, which all live on the invocation path.
- βΈ One bad server can poison your good ones: 'tool shadowing' lets a malicious server's description silently redefine how the model uses a trusted server's tools, so per-server review in isolation is insufficient.
- βΈ Treat adding an MCP server like adding a dependency: pin versions, hash and review the full manifest (not the abridged UI label), and re-review on change to catch rug-pulls.
- βΈ Input-side hygiene (spotlighting, review) lowers but never zeroes injection β least-privilege identity and egress/argument controls are what cap the damage when a poisoned description still lands.
Sources
- MCP Security Notification: Tool Poisoning Attacks β Invariant Labs β
- invariantlabs-ai/mcp-injection-experiments (PoC code) β
- Model Context Protocol has prompt injection security problems β Simon Willison β
- MCP Security Notification: Tool Poisoning Attacks β Invariant Labs β β Original disclosure: poisoned tool descriptions, file exfiltration via tool args, and tool shadowing across servers.
- invariantlabs-ai/mcp-injection-experiments (PoC code) β β Proof-of-concept code for the tool-poisoning and shadowing experiments.
- Model Context Protocol has prompt injection security problems β Simon Willison β β Independent corroboration and framing of the tool-description injection class.
Practise the risk class β related scenarios
A support email hides instructions β and the assistant obeys them
A poisoned issue makes the agent lie to the human who approves its actions
A fake Sentry error report hijacks a developer's coding agent into running a shell command
The forensic record is itself the attack surface β an agent's log is poisoned, then quietly rewritten
A shopping page tells the agent to do something the user never asked for
A single poisoned document plants a standing instruction that survives every reset
A screenshot that's harmless at full size becomes an order once the system shrinks it
A trusted MCP email tool quietly BCCs every message to an attacker
The eval gate that was supposed to catch the agent is itself the thing being attacked
A poisoned web page hijacks a research agent β and the planner acts on its behalf
An inbox summary quietly ships a secret to an attacker's server