Definition
The use of carefully designed prompts to encourage a Gen AI system to circumvent its programmed guardrails or filters, allowing malicious actors to generate content an FI explicitly sought to disallow.
Interactive deep-dive
This risk surfaces under more than one interactive treatment β each with its own technical detail, attack surface, detection signals, and scenarios.
β Suggested sub-risks β not yet in your taxonomy
Granular vectors recommended under this risk.
Malicious instructions delivered via untrusted third-party content the model ingests (retrieved documents, browsed pages, emails, tool outputs) rather than the user's direct prompt β so the attacker is not the prompting user and the goal is hijacking the agent's actions, not just bypassing content filters.
Injection that persists by writing to long-term agent memory, so a single manipulation becomes a durable cross-session compromise re-injected as quasi-system context until detected and purged.
Controls & guardrails that address this
123 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Design the system prompt architecture with privilege separation and trust tier definitions at design stage.
Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.
Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.
source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions β so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.
source: Case study: anamorpher-image-scaling-injection (Trail of Bits β Morozova & Hussain, 21 Aug 2025)Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.
source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).
Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.
Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.
source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.
source: Interactive-control reconciliation: ctrl-logging (partial coverage)Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.
Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.
source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least PrivilegeRe-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Real-world cases
31Actual published events that illustrate this risk β click through for the writeup and sources.
Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.
An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.
Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.
Roleplay framings ('my late grandma used to read meβ¦') coaxed chatbots past safety training into producing restricted content.
Optimised gibberish suffixes that transfer across models to reliably elicit refused content β automated, transferable jailbreaks.
Filling a long context with many faux-compliant dialogue examples erodes a model's refusals β an attack that scales with context length.
Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.
Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.
Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail β while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)
Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.
Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.
A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL β with no user click.
An academic paper showed instructions hidden in a webpage hijacking an LLM-integrated app reading it β coining 'indirect prompt injection'.
Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.
Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.
Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.
Microsoft AI Red Team whitepaper enumerating agentic failure modes, including resource/service exhaustion from runaway loops and fan-out.
Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.
AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent β performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.
A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud β with no user click and, per Radware, no client-side or network evidence.
Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.
Researcher Johann Rehberger showed that injected instructions in source code, web pages, or GitHub issues could make the Copilot agent silently write "chat.tools.autoApprove": true into .vscode/settings.json, disabling human approval and granting unattended shell execution β a self-config-rewrite to full-host compromise (CVE-2025-53773).
A red-team PoC forged an inflated A2A 'agent card' so the orchestrator's LLM-as-judge routing always selected the rogue agent, diverting every task through the attacker.
Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.
Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next β demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.
Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.
A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.
Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine β an indirect-injection attack delivered through a trusted MCP integration.
Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output β spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.
Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content β reportedly exceeding 90% attack success across three models.