Definition
AI agents invoke tools in a manner that exceeds intended permissions, constraints, or operational boundaries — due to design weaknesses, excessive autonomy, or adversarial manipulation. This may result in unintended actions, including data exfiltration at the tool layer, unauthorised system changes, or alteration of agent behaviour. (Source workbook label: 'Tool-mediated data exfiltration'.)
Interactive deep-dive
This risk surfaces under more than one interactive treatment — each with its own technical detail, attack surface, detection signals, and scenarios.
★ Suggested sub-risks — not yet in your taxonomy
Granular vectors recommended under this risk.
Model-generated code or commands executed without a sandbox/isolation boundary, enabling injection into downstream systems (SQL/OS command injection), SSRF, or escape from the intended scope.
A malicious or compromised tool/MCP server hides directives in tool descriptions (which are injected into the prompt), swaps behaviour after approval (rug-pull), shadows another server's tools, or ships a backdoored package — making the tool registry an instruction + software supply-chain channel.
An indirect prompt injection delivered through the tool-response data of a legitimate, trusted integration (e.g. an MCP server) whose upstream service accepts writes from parties other than the legitimate application — such as open event ingestion authenticated only by a public, write-only key (a Sentry DSN). Because the integration itself is benign and vetted, its returned data is treated as trusted context; an attacker who can write to the upstream store thereby injects instructions the agent obeys with its own privileges.
Controls & guardrails that address this
175 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.
source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureWhen onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.
source: Case study: agentjacking-sentry-mcpConstrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.
source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.
source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.
source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringBuild signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.
source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.
source: Case study: sesameop-openai-assistants-api-c2Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.
source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.
source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.
source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and ReconstitutionRed-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.
source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive AgencyPermit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureEnforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.
source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringReal-world cases
39Actual published events that illustrate this risk — click through for the writeup and sources.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.
AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.
Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.
An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.
Microsoft's incident-response team found a .NET backdoor that hid its command-and-control channel inside a legitimate OpenAI Assistants API account, fetching encrypted commands stored as Assistant messages — turning an LLM provider's API into stealth attacker infrastructure.
Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.
A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.
Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.
Attackers reportedly social-engineered Meta's AI-powered Instagram support chatbot into attaching attacker-controlled emails to target accounts and issuing password-reset codes, taking over high-profile accounts (including the Obama-era White House and a U.S. Space Force CMSgt) without the owner's email or any MFA prompt.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.
A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.
A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud — with no user click and, per Radware, no client-side or network evidence.
Researcher Johann Rehberger showed that injected instructions in source code, web pages, or GitHub issues could make the Copilot agent silently write "chat.tools.autoApprove": true into .vscode/settings.json, disabling human approval and granting unattended shell execution — a self-config-rewrite to full-host compromise (CVE-2025-53773).
Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.
Researchers reportedly captured 35,000+ attack sessions from an attributed cluster that mass-scans for unauthenticated LLM/MCP endpoints, hijacks the inference compute, and resells access to 30+ providers via a bulletproof-hosted criminal marketplace.
An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.
Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.
A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL — with no user click.
Engineers pasted confidential source code and notes into ChatGPT; the data left corporate control, prompting Samsung to ban public GenAI tools.
Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.
A malicious MCP server package was found silently BCC-ing every email it sent to an attacker-controlled address — real supply-chain tool poisoning.
Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.
Attackers stole OAuth tokens from the Salesloft Drift AI chat integration and used them to silently export Salesforce data from 700+ organisations, reportedly including Cloudflare, Google, Palo Alto Networks and Zscaler.
Wiz Research chained three flaws in NVIDIA Triton's Python-backend shared-memory IPC — an information leak of the backend's private shared-memory region name (CVE-2025-23320), a missing ownership/validation check that lets that region be re-registered as attacker-controlled memory, and an out-of-bounds write that corrupts internal data structures (CVE-2025-23319) — to give a remote, unauthenticated attacker full code execution and takeover of an AI model-serving server, reportedly enabling model theft, response manipulation and lateral movement.
As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.
Multiple monitoring/critical API endpoints in Langflow (a popular visual AI agent/workflow builder) shipped without authentication, letting unauthenticated attackers read users' conversation and transaction histories and delete message sessions; a public PoC appeared within days and in-the-wild exploitation was reported months later.
Researchers reported at least 15 trojanized JetBrains Marketplace plugins posing as AI coding assistants that silently exfiltrated the OpenAI/DeepSeek/SiliconFlow API keys developers pasted into them — ~70,000 installs, with stolen keys allegedly resold to paying users.
A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.
Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.
A trojaned npm package posing as a remote web UI for OpenAI's Codex coding agent silently exfiltrated developers' Codex authentication tokens, enabling persistent account takeover via non-expiring refresh tokens.
Malicious 'lightning' PyPI releases (reportedly 2.6.2 and 2.6.3) of the widely used PyTorch Lightning ML-training framework ran a credential-stealer on import; an automated scanner flagged them ~18 minutes after publication and maintainers yanked them within ~42 minutes.
Unit 42 showed that when a Hugging Face account is deleted (or a model is transferred and the old author later removed), its Author/ModelName namespace can be re-registered by anyone — so platforms and code that resolve models by name auto-deploy attacker-controlled weights, demonstrated as reverse-shell RCE on Google Vertex AI Model Garden and Azure AI Foundry.
Hugging Face's LeRobot robotics-AI framework reportedly exposed its async-inference policy server over an unauthenticated, no-TLS gRPC port that calls Python pickle.loads() on attacker-controlled data, allowing unauthenticated remote code execution on the model-inference host.
A CVSS 10.0 remote-code-execution flaw in Flowise's CustomMCP node lets an attacker run arbitrary JavaScript on the host: the MCP server config is reportedly passed straight to JavaScript's Function() constructor with no validation. Disclosed in Sept 2025 and patched in 3.0.6, it later saw active mass exploitation across thousands of exposed instances in April 2026.
Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.
Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.
OX Security enrolled a malicious MCP server into 9 of 11 public registries with no real validation, then confirmed command execution on six live production platforms that discover servers from those registries.