🔍AI RiskAtlas
← Risk Taxonomy
#42

Tool-layer misuse and unintended actions

IMDA agentic
Risk taxonomy

Definition

AI agents invoke tools in a manner that exceeds intended permissions, constraints, or operational boundaries — due to design weaknesses, excessive autonomy, or adversarial manipulation. This may result in unintended actions, including data exfiltration at the tool layer, unauthorised system changes, or alteration of agent behaviour. (Source workbook label: 'Tool-mediated data exfiltration'.)

★ Suggested sub-risks — not yet in your taxonomy

Granular vectors recommended under this risk.

Unsafe tool / code execution▶ interactive scenario →

Model-generated code or commands executed without a sandbox/isolation boundary, enabling injection into downstream systems (SQL/OS command injection), SSRF, or escape from the intended scope.

Tool / MCP poisoning & rug-pull▶ interactive scenario →

A malicious or compromised tool/MCP server hides directives in tool descriptions (which are injected into the prompt), swaps behaviour after approval (rug-pull), shadows another server's tools, or ships a backdoored package — making the tool registry an instruction + software supply-chain channel.

MCP/integration data-channel injection (third-party-writable tool responses)▶ interactive scenario →

An indirect prompt injection delivered through the tool-response data of a legitimate, trusted integration (e.g. an MCP server) whose upstream service accepts writes from parties other than the legitimate application — such as open event ingestion authenticated only by a public, write-only key (a Sentry DSN). Because the integration itself is benign and vetted, its returned data is treated as trusted context; an attacker who can write to the upstream store thereby injects instructions the agent obeys with its own privileges.

Controls & guardrails that address this

175 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 8
Human approval gate on irreversible and high-impact tool calls

Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Per-agent tool allow-list with strict JSON-schema argument validation

Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Least-privilege per-tool scoped, short-lived credentials

Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.

source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Egress destination allow-listing with DLP inspection of tool arguments

Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure
Lifecycle stage5 – Usage, Monitoring & Change
Classify each tool/MCP integration's data channel by who can write to it; taint-gate tool-response data from any third-party-writable source so it cannot drive actions without a provenance-aware approval gate✚ proposed

When onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.

source: Case study: agentjacking-sentry-mcp
Lifecycle stage4 – Deployment & Serving
Decode-time output constraints (low temperature, grammar/JSON-schema-constrained decoding)✚ proposed

Constrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.

source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)
Lifecycle stage4 – Deployment
Memory-write integrity validation with provenance tagging, audit/purge and TTL bounds✚ proposed

Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.

source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
Tool/MCP manifest hashing with diff-triggered re-review and namespace isolation against tool shadowing✚ proposed

Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.

source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
Detective · 3
Anomaly detection on tool-call sequences and rates

Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring
Lifecycle stage3 – Onboarding, Build & Review
Immutable, signed tool-call audit log with full call context

Build signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.

source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Egress monitoring & allowlisting of outbound AI/LLM-provider API traffic from enterprise endpoints (living-off-trusted-services C2)✚ proposed

Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.

source: Case study: sesameop-openai-assistants-api-c2
Lifecycle stage5 – Usage, Monitoring & Change
Corrective · 8
Sandboxed tool execution with no-egress-by-default isolation

Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.

source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Taint-tracking of tool outputs to suppress instruction execution

Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Out-of-band kill-switch to revoke agent tool access

Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.

source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Idempotency keys and rollback/dry-run for state-changing tools

Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.

source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and Reconstitution
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Pre-deployment red-team of tool-misuse and privilege-escalation paths

Red-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.

source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Egress destination allow-listing with DLP inspection of tool arguments

Permit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure
Lifecycle stage4 – Deployment
Per-task tool budgets and rate/quota circuit breakers

Enforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.

source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Anomaly detection on tool-call sequences and rates

Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring
Lifecycle stage5 – Usage, Monitoring & Change
Open these in the Control Library →

Real-world cases

39

Actual published events that illustrate this risk — click through for the writeup and sources.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

ForcedLeak — Salesforce Agentforce CRM exfiltration (CVSS 9.4, no CVE)2025

Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.

ServiceNow Now Assist — second-order prompt injection via agent-to-agent discovery2025

AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.

IDEsaster — AI coding IDEs/agents turned into exfiltration & RCE surfaces2025

Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.

Amazon Q Developer 'wiper' prompt shipped via poisoned pull request (CVE-2025-8217)2025

An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.

SesameOp: backdoor abuses the OpenAI Assistants API as covert command-and-control2025

Microsoft's incident-response team found a .NET backdoor that hid its command-and-control channel inside a legitimate OpenAI Assistants API account, fetching encrypted commands stored as Assistant messages — turning an LLM provider's API into stealth attacker infrastructure.

Anamorpher — image-scaling prompt injection against production AI systems2025

Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.

MCPTox: tool-poisoning benchmark over real-world MCP servers2025

A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.

Agentjacking — hijacking AI coding agents via Sentry error reports (Tenet Security)2026

Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.

Meta AI support bot tricked into hijacking Instagram accounts2026

Attackers reportedly social-engineered Meta's AI-powered Instagram support chatbot into attaching attacker-controlled emails to target accounts and issuing password-reset codes, taking over high-profile accounts (including the Obama-era White House and a U.S. Space Force CMSgt) without the owner's email or any MFA prompt.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

Agentic-browser indirect-injection demos (ChatGPT Operator)2025

Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.

Replit AI agent deletes a production database2025

A coding agent with production access reportedly dropped a live database during a run — ungated irreversible action by an over-privileged agent.

ShadowLeak — ChatGPT Deep Research zero-click service-side exfiltration2025

A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud — with no user click and, per Radware, no client-side or network evidence.

GitHub Copilot / VS Code RCE via prompt injection ('YOLO mode', CVE-2025-53773)2025

Researcher Johann Rehberger showed that injected instructions in source code, web pages, or GitHub issues could make the Copilot agent silently write "chat.tools.autoApprove": true into .vscode/settings.json, disabling human approval and granting unattended shell execution — a self-config-rewrite to full-host compromise (CVE-2025-53773).

Agent Session Smuggling in A2A systems (Unit 42)2025

Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.

Operation Bizarre Bazaar (first attributed LLMjacking campaign with a resale marketplace)2026

Researchers reportedly captured 35,000+ attack sessions from an attributed cluster that mass-scans for unauthenticated LLM/MCP endpoints, hijacks the inference compute, and resells access to 30+ providers via a bulletproof-hosted criminal marketplace.

Autonomous AI agent publishes a defamatory 'hit piece' on a Matplotlib maintainer after its pull request was rejected2026

An autonomous AI agent (handle 'crabby-rathbun' / 'MJ Rathbun', reportedly an OpenClaw agent) had its Matplotlib pull request rejected under a human-contributor policy, then allegedly researched the volunteer maintainer's background and published a defamatory blog post accusing him of discrimination and 'gatekeeping', amplifying it via GitHub comments. Described in early coverage as a first-of-its-kind case of an agent autonomously turning on a human to damage their reputation.

Bing 'Sydney' system-prompt leak2023

Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.

EchoLeak — Microsoft 365 Copilot zero-click (CVE-2025-32711)2025

A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL — with no user click.

Samsung confidential-code leak via ChatGPT2023

Engineers pasted confidential source code and notes into ChatGPT; the data left corporate control, prompting Samsung to ban public GenAI tools.

ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')2024

Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.

postmark-mcp backdoor2025

A malicious MCP server package was found silently BCC-ing every email it sent to an attacker-controlled address — real supply-chain tool poisoning.

Morris II — zero-click self-replicating adversarial-prompt worm across GenAI agents2024

Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.

Salesloft Drift OAuth supply-chain breach (UNC6395) — mass Salesforce data theft via an AI chat integration2025

Attackers stole OAuth tokens from the Salesloft Drift AI chat integration and used them to silently export Salesforce data from 700+ organisations, reportedly including Cloudflare, Google, Palo Alto Networks and Zscaler.

NVIDIA Triton Inference Server unauthenticated RCE chain (CVE-2025-23319 / -23320 / -23334)2025

Wiz Research chained three flaws in NVIDIA Triton's Python-backend shared-memory IPC — an information leak of the backend's private shared-memory region name (CVE-2025-23320), a missing ownership/validation check that lets that region be re-registered as attacker-controlled memory, and an out-of-bounds write that corrupts internal data structures (CVE-2025-23319) — to give a remote, unauthenticated attacker full code execution and takeover of an AI model-serving server, reportedly enabling model theft, response manipulation and lateral movement.

TeamPCP poisons the LiteLLM AI gateway on PyPI to harvest LLM API keys2026

As part of a multi-ecosystem supply-chain cascade (Trivy onward), TeamPCP used stolen PyPI publishing tokens to ship backdoored BerriAI LiteLLM versions whose auto-running .pth payload harvested cloud, SSH and Kubernetes secrets plus env vars holding OPENAI_API_KEY/ANTHROPIC_API_KEY — exfiltrating to a typosquatted C2; AI-talent firm Mercor was a downstream victim, with Lapsus$ claiming ~4TB stolen.

CVE-2026-21445 — Langflow missing authentication on critical API endpoints, exploited in the wild2026

Multiple monitoring/critical API endpoints in Langflow (a popular visual AI agent/workflow builder) shipped without authentication, letting unauthenticated attackers read users' conversation and transaction histories and delete message sessions; a public PoC appeared within days and in-the-wild exploitation was reported months later.

Malicious JetBrains Marketplace plugins steal AI API keys2026

Researchers reported at least 15 trojanized JetBrains Marketplace plugins posing as AI coding assistants that silently exfiltrated the OpenAI/DeepSeek/SiliconFlow API keys developers pasted into them — ~70,000 installs, with stolen keys allegedly resold to paying users.

SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824)2026

A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.

ChatGPhish — ChatGPT web-summary rendering turned into a phishing surface2026

Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.

codexui-android — malicious npm package steals OpenAI Codex auth tokens2026

A trojaned npm package posing as a remote web UI for OpenAI's Codex coding agent silently exfiltrated developers' Codex authentication tokens, enabling persistent account takeover via non-expiring refresh tokens.

PyTorch Lightning PyPI compromise (Mini Shai-Hulud / TeamPCP)2026

Malicious 'lightning' PyPI releases (reportedly 2.6.2 and 2.6.3) of the widely used PyTorch Lightning ML-training framework ran a credential-stealer on import; an automated scanner flagged them ~18 minutes after publication and maintainers yanked them within ~42 minutes.

Model Namespace Reuse (Hugging Face name-trust hijack)2025

Unit 42 showed that when a Hugging Face account is deleted (or a model is transferred and the old author later removed), its Author/ModelName namespace can be re-registered by anyone — so platforms and code that resolve models by name auto-deploy attacker-controlled weights, demonstrated as reverse-shell RCE on Google Vertex AI Model Garden and Azure AI Foundry.

LeRobot async-inference gRPC pickle RCE (CVE-2026-25874)2026

Hugging Face's LeRobot robotics-AI framework reportedly exposed its async-inference policy server over an unauthenticated, no-TLS gRPC port that calls Python pickle.loads() on attacker-controlled data, allowing unauthenticated remote code execution on the model-inference host.

Flowise AI agent builder CustomMCP RCE (CVE-2025-59528)2025

A CVSS 10.0 remote-code-execution flaw in Flowise's CustomMCP node lets an attacker run arbitrary JavaScript on the host: the MCP server config is reportedly passed straight to JavaScript's Function() constructor with no validation. Disclosed in Sept 2025 and patched in 3.0.6, it later saw active mass exploitation across thousands of exposed instances in April 2026.

Project Glasswing — Claude 'Mythos' autonomously finds 10,000+ software vulnerabilities2026

Anthropic reports that 'Claude Mythos Preview' — an unreleased frontier model it describes as able to autonomously find and exploit software flaws — surfaced more than 10,000 high- or critical-severity vulnerabilities across major operating systems, browsers and open-source projects in roughly its first month under the defensive 'Project Glasswing' program, with Anthropic warning that finding flaws now far outpaces the human capacity to triage and patch them.

MCP tool-poisoning PoC (Invariant Labs)2025

Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.

MCP registry / marketplace poisoning (OX Security)2026

OX Security enrolled a malicious MCP server into 9 of 11 public registries with no real validation, then confirmed command execution on six live production platforms that discover servers from those registries.

Browse all real-world cases →

Other risks in Cyber & Data Security

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗