πŸ”AI RiskAtlas
← Real-world cases
Case study

MCP tool-poisoning PoC (Invariant Labs)

Research demonstration01 Apr 2025πŸ—ΊοΈ Tool-Using Agent

Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.

Root cause β€” why it happened

Agents can plug into 'tool servers' (MCP servers) that advertise what they can do β€” each tool comes with a short written description, like 'sends an email' or 'adds two numbers'. The agent reads those descriptions so the model knows what's on offer. Invariant Labs showed the catch: an attacker who controls a tool server can hide secret instructions inside those descriptions. The moment the agent just lists the available tools β€” before it ever uses any of them β€” those hidden instructions land in the model's context and can steer it: read a secret file, change what another tool does, then hide the evidence. No malicious tool ever has to run; advertising the tool is enough.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

UntrustedAgent coreOversightThe real worldgoalπŸ§‘UserπŸŽ›οΈOrchestrator /Agent Loop🧠LLMπŸ”Identity &PermissionsπŸ”§Tool Runtimeβœ‹Human ApprovalGateπŸ”ŒExternal APIsπŸ—„οΈBusinessDatabase🌐UntrustedContentπŸ“Audit Logging🧰Attacker-controlledMCP server🧰Trusted MCPserver
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 6

The agent adds an MCP tool server

A developer wires their coding agent up to a handy-looking MCP tool server β€” the kind of thing people share and install all the time. It advertises a few tools, including something innocent like a calculator. Adding it feels as low-stakes as installing a small plugin.

βš™οΈAdding the MCP server (illustrative)config
// agent client config
{
  "mcpServers": {
    "handy-tools": {
      "command": "npx",
      "args": ["-y", "handy-tools-mcp"]
    }
  }
}
// Registered as a tool source. Nothing has run yet.
// kind: RESEARCH proof-of-concept (Invariant Labs), not a live incident.
Step 1 / 6

Controls & guardrails β€” what would have stopped it

The control aimed straight at this is treating tool servers like software you vet: lock to a reviewed version, read the FULL descriptions (not the short label), and re-check whenever a server changes β€” that's MCP pinning and manifest review. Pair it with treating those descriptions as untrusted text rather than orders, giving the agent only the access it needs, and controlling where it can send data. The honest catch: a subtly-worded malicious description can slip past a human reviewer, and treating descriptions as 'just data' lowers the odds of a hijack but never to zero β€” so the access limits and egress controls are what cap the damage when an injection still lands.

Preventive
  • Delimiting / spotlighting of untrusted content

    A trained convention, not enforcement. Determined payloads still break out, especially when content is long or the attack is novel. Combine with action-layer controls.

  • MCP/plugin pinning, manifest hashing & re-review

    Review catches what reviewers understand; a subtle malicious directive can pass. Pinning helps only if you actually re-review on update rather than auto-accepting.

  • Least-privilege identity & scoped credentials

    Doesn't prevent manipulation β€” only caps its reach. Hard to get right operationally; over-broad scopes are the common real-world failure.

  • Egress allowlisting & DLP on tool arguments

    Allowlists fight an open-ended channel; legitimate-but-broad destinations (any URL fetch, any email) are hard to constrain without breaking usefulness. Encoding can evade naive DLP.

  • Tool argument validation & sandboxing

    Validates form, not intent β€” a well-formed call to a permitted tool can still be the wrong call. Sandboxing adds latency and isn't always feasible for tools that touch production.

Detective
  • Full-trace audit logging

    Logging is forensic, not preventive β€” it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.

  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • β–Έ Tool descriptions are prompts: in MCP, a server's free-text tool description is injected into the model's context and parsed with full trust, making metadata an injection vector.
  • β–Έ The hijack can fire at tool ENUMERATION β€” before any tool is called β€” so it sits ahead of authorization, argument validation and human-approval gates, which all live on the invocation path.
  • β–Έ One bad server can poison your good ones: 'tool shadowing' lets a malicious server's description silently redefine how the model uses a trusted server's tools, so per-server review in isolation is insufficient.
  • β–Έ Treat adding an MCP server like adding a dependency: pin versions, hash and review the full manifest (not the abridged UI label), and re-review on change to catch rug-pulls.
  • β–Έ Input-side hygiene (spotlighting, review) lowers but never zeroes injection β€” least-privilege identity and egress/argument controls are what cap the damage when a poisoned description still lands.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—