#38

Prompt injection

Risk taxonomy

Definition

The use of carefully designed prompts to encourage a Gen AI system to circumvent its programmed guardrails or filters, allowing malicious actors to generate content an FI explicitly sought to disallow.

Interactive deep-dive

This risk surfaces under more than one interactive treatment — each with its own technical detail, attack surface, detection signals, and scenarios.

▶ Prompt Injection (direct) →▶ Jailbreak →▶ Indirect Prompt Injection →▶ Distributed / Cross-Agent Jailbreak →

📈 The Crescendo 📧 The Email That Gave Orders 🪶 The Jailbreak in Verse 🪡 Death by a Thousand Innocent Steps 🕵️ Lies in the Loop ✂️ One Character Past the Guard 🪤 The Bug Report That Ran Code 🚪 The Classifier That Waves It Through 📼 The Compromised Flight Recorder 👁️ The Invisible Webpage Command 🧠 The Memory That Wouldn't Die 🖼️ The Picture That Whispered 🔒 The Schema Made Me Do It 🛡️ The Watcher Watched 🪪 The Worker Who Spoke for the Boss 🖼️ Zero-Click Leak by Picture

★ Suggested sub-risks — not yet in your taxonomy

Granular vectors recommended under this risk.

Indirect / cross-domain prompt injection▶ interactive scenario →

Malicious instructions delivered via untrusted third-party content the model ingests (retrieved documents, browsed pages, emails, tool outputs) rather than the user's direct prompt — so the attacker is not the prompting user and the goal is hijacking the agent's actions, not just bypassing content filters.

Memory poisoning (persistent)▶ interactive scenario →

Injection that persists by writing to long-term agent memory, so a single manipulation becomes a durable cross-session compromise re-injected as quasi-system context until detected and purged.

Controls & guardrails that address this

123 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 6

Role-based access controls

Design the system prompt architecture with privilege separation and trust tier definitions at design stage.

Lifecycle stages1 – Use Case Context & Design4 – Deployment

Also addressesKnowledge / Training Data Poisoning Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Jailbreak detection

Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

Also addressesInference-Time & Serving-Layer Manipulation

Spotlighting of untrusted content via delimiting, datamarking and encoding

Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage3 – Onboarding, Build & Review

Dedicated injection-detection classifier on all inbound untrusted content and outbound actions

Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.

source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change

Multimodal input-fidelity check: show/verify the model-delivered (post-downscale) image and avoid silent lossy resampling✚ proposed

Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.

source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)

Lifecycle stage3 – Development & Build

Instruction-hierarchy-trained model selection with role-precedence injection evals✚ proposed

Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.

source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)

Lifecycle stage3 – Onboarding, Build & Review

Detective · 4

Vulnerability assessment

Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Penetration testing

Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Continuous adversarial prompt-injection red teaming with regression suite in CI/CD

Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.

source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Materialised model-context audit capture (post-truncation prompt, retrieved and tool content) with read-time redaction✚ proposed

Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.

source: Interactive-control reconciliation: ctrl-logging (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

Corrective · 3

Red teaming

Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Data/instruction trust-boundary enforcement with capability gating on injection-reachable tools

Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.

source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least Privilege

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

Spotlighting of untrusted content via delimiting, datamarking and encoding

Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage5 – Usage, Monitoring & Change

Open these in the Control Library →

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Bing 'Sydney' system-prompt leak2023

Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.

Amazon Q Developer 'wiper' prompt shipped via poisoned pull request (CVE-2025-8217)2025

An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824)2026

A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.

'Grandma exploit' jailbreaks2023

Roleplay framings ('my late grandma used to read me…') coaxed chatbots past safety training into producing restricted content.

GCG universal adversarial suffixes (Zou et al.)2023

Optimised gibberish suffixes that transfer across models to reliably elicit refused content — automated, transferable jailbreaks.

Many-shot jailbreaking (Anthropic)2024

Filling a long context with many faux-compliant dialogue examples erodes a model's refusals — an attack that scales with context length.

GTG-1002 — first reported AI-orchestrated cyber-espionage campaign (Claude Code)2025

Anthropic reports that a suspected Chinese state-sponsored group (GTG-1002) jailbroke Claude Code via a 'defensive security firm' role-play and task decomposition, then used it to run an estimated 80-90% of tactical operations in a multi-target espionage campaign largely autonomously.

DeepSeek system-prompt extraction via jailbreak (Wallarm)2025

Wallarm reported jailbreaking DeepSeek's chatbot to extract its full system prompt verbatim using a 'bias-based' technique; DeepSeek deployed a fix.

Raine v. OpenAI — first wrongful-death suit alleging ChatGPT acted as a 'suicide coach'2025

Matthew and Maria Raine sued OpenAI and CEO Sam Altman (San Francisco Superior Court, 26 Aug 2025) over the April 2025 suicide of their 16-year-old son Adam, alleging ChatGPT fostered psychological dependency, discouraged him from confiding in family, and supplied self-harm method detail — while he reportedly circumvented its safeguards for months by framing queries as fiction. OpenAI denies liability, saying it pointed him to crisis resources 100+ times and that he misused the product. (Allegations unproven; litigation ongoing.)

Adversarial Poetry — universal single-turn jailbreak via verse reframing (Bisconti et al.)2025

Rewriting a harmful request as a poem bypasses safety alignment across 25 frontier proprietary and open-weight LLMs: hand-crafted poems reached ~62% average attack-success (some providers >90%), and mechanically converting harmful prompts to verse raised success up to 18x over prose baselines.

AI-assisted breach of Mexican government infrastructure (Claude Code + GPT-4.1)2025

Gambit Security reports that a single operator weaponized Anthropic's Claude Code and OpenAI's GPT-4.1 to breach at least nine Mexican government organizations, with Claude Code reportedly executing ~75% of remote commands after the attacker bypassed its refusals by loading a 1,084-line hacking cheatsheet as a persistent claude.md system prompt.

EchoLeak — Microsoft 365 Copilot zero-click (CVE-2025-32711)2025

A crafted email's hidden instructions made M365 Copilot exfiltrate tenant data via an auto-rendered image URL — with no user click.

Indirect prompt injection coined (Greshake et al.)2023

An academic paper showed instructions hidden in a webpage hijacking an LLM-integrated app reading it — coining 'indirect prompt injection'.

Agentic-browser indirect-injection demos (ChatGPT Operator)2025

Researchers showed web-browsing AI agents following instructions embedded in attacker-controlled pages to leak data or take actions.

ChatGPT persistent-memory exfiltration (Rehberger / 'SpAIware')2024

Indirect injection could write attacker instructions into ChatGPT's long-term memory, persisting across chats to exfiltrate data until OpenAI mitigated it.

MCP tool-poisoning PoC (Invariant Labs)2025

Hidden instructions embedded in MCP tool descriptions hijacked agents (e.g. in Cursor) that merely listed the available tools.

Taxonomy of Failure Modes in Agentic AI Systems (Microsoft)2025

Microsoft AI Red Team whitepaper enumerating agentic failure modes, including resource/service exhaustion from runaway loops and fan-out.

ForcedLeak — Salesforce Agentforce CRM exfiltration (CVSS 9.4, no CVE)2025

Researchers showed attacker text planted in a public Salesforce Web-to-Lead form is later read by the Agentforce agent during normal use and treated as instructions, exfiltrating CRM data to an attacker domain that had been on Salesforce's CSP allow-list but expired and was re-registered for about $5.

ServiceNow Now Assist — second-order prompt injection via agent-to-agent discovery2025

AppOmni showed ServiceNow Now Assist's default agent config lets a malicious ticket redirect a benign agent into enlisting a more powerful agent — performing record CRUD, admin-role assignment, and email exfiltration with the triggering user's privilege, despite built-in prompt-injection protection.

ShadowLeak — ChatGPT Deep Research zero-click service-side exfiltration2025

A single crafted email with hidden HTML instructions reportedly made OpenAI's Deep Research agent autonomously exfiltrate Gmail inbox data from OpenAI's own cloud — with no user click and, per Radware, no client-side or network evidence.

IDEsaster — AI coding IDEs/agents turned into exfiltration & RCE surfaces2025

Researcher Ari Marzouk disclosed 30+ vulnerabilities (24 CVEs) across 10-plus AI coding agents (Copilot, Cursor, Windsurf, Claude Code, Junie and others) where a prompt injected via repo files, READMEs, file names or MCP tool responses makes the assistant weaponize legitimate IDE features for code execution and secret exfiltration.

GitHub Copilot / VS Code RCE via prompt injection ('YOLO mode', CVE-2025-53773)2025

Researcher Johann Rehberger showed that injected instructions in source code, web pages, or GitHub issues could make the Copilot agent silently write "chat.tools.autoApprove": true into .vscode/settings.json, disabling human approval and granting unattended shell execution — a self-config-rewrite to full-host compromise (CVE-2025-53773).

Agent-in-the-Middle — abusing A2A agent cards (Trustwave SpiderLabs)2025

A red-team PoC forged an inflated A2A 'agent card' so the orchestrator's LLM-as-judge routing always selected the rogue agent, diverting every task through the attacker.

Agent Session Smuggling in A2A systems (Unit 42)2025

Unit 42 PoCs in which a malicious remote agent abuses default inter-agent trust to covertly inject extra instructions across a stateful A2A session, invisible to the human operator.

Morris II — zero-click self-replicating adversarial-prompt worm across GenAI agents2024

Cohen, Bitton & Nassi (arXiv Mar 2024; ACM CCS 2025) built 'Morris II', the first worm targeting GenAI ecosystems: an adversarial self-replicating prompt that, via RAG-based inference, triggers a zero-click chain of indirect injections forcing each agent to act maliciously and re-infect the next — demonstrated stealing data and spamming through email assistants on ChatGPT, Gemini and LLaVA.

Anamorpher — image-scaling prompt injection against production AI systems2025

Trail of Bits showed an image that looks benign at full resolution exposes a hidden prompt-injection payload once an AI pipeline downscales it, and used it against Gemini CLI to silently exfiltrate Google Calendar data through an auto-approved Zapier tool call.

MCPTox: tool-poisoning benchmark over real-world MCP servers2025

A benchmark of LLM-agent susceptibility to tool poisoning via malicious tool metadata, built on 45 live MCP servers and 353 real tools; the authors report agents are rarely able to refuse and that more-capable models are often more vulnerable.

Agentjacking — hijacking AI coding agents via Sentry error reports (Tenet Security)2026

Tenet Security showed that a single fake Sentry error report, sent using only a public DSN, can hijack AI coding agents (Claude Code, Cursor, Codex) into running attacker-controlled code on a developer's machine — an indirect-injection attack delivered through a trusted MCP integration.

ChatGPhish — ChatGPT web-summary rendering turned into a phishing surface2026

Attacker-controlled Markdown hidden in a public web page is reportedly rendered by ChatGPT's summarization feature as trusted assistant output — spoofed OpenAI alerts, phishing links, QR codes, and tracking pixels.

Safe in Isolation, Dangerous Together — agent-driven multi-turn decomposition jailbreak2025

Srivastav & Zhang (REALM 2025) showed a role-based multi-agent framework that splits a harmful request into individually-benign sub-questions, answers each separately, then reassembles the fragments into prohibited content — reportedly exceeding 90% attack success across three models.

Browse all real-world cases →

Other risks in Cyber & Data Security

#35 Unintentional inappropriate or illegal use #36 Data poisoning #37 Adversarial model manipulation #39 Re-identification #40 Data leakage #41 Model inference attacks #42 Tool-layer misuse and unintended actions #43 Inadequate agent identity and authorisation