Tool Misuse
highAgency & toolsDefinition
The AI uses a real tool the wrong way — sends the email to the wrong person, runs the wrong query, calls the dangerous action when a safe one would do.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Tool calls with anomalous arguments (recipients, amounts, scopes)
- ▸ Unusual tool sequences vs. the workflow's norm
- ▸ Repeated retries hammering an external API
Controls & guardrails that address this
225 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.
source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureWhen onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.
source: Case study: agentjacking-sentry-mcpConstrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.
source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.
source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.
source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Turning down randomness and forcing answers into a strict format so the model improvises less.
Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringBuild signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.
source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.
source: Case study: sesameop-openai-assistants-api-c2Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.
source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.
source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.
source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and ReconstitutionRed-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.
source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive AgencyPermit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureEnforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.
source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringFramework mappings
- LLM06:2025 Excessive Agency
- AML.T0053 LLM Plugin Compromise
- MANAGE 2.2
Real-world cases
11Actual published events that illustrate this risk — click through for the writeup and sources.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.
AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.
Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.
An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.
Microsoft's incident-response team found a .NET backdoor that hid its command-and-control channel inside a legitimate OpenAI Assistants API account, fetching encrypted commands stored as Assistant messages — turning an LLM provider's API into stealth attacker infrastructure.
Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.
A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.
Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.
Attackers reportedly social-engineered Meta's AI-powered Instagram support chatbot into attaching attacker-controlled emails to target accounts and issuing password-reset codes, taking over high-profile accounts (including the Obama-era White House and a U.S. Space Force CMSgt) without the owner's email or any MFA prompt.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.