Prompt Injection (direct)

highInput manipulation

Also known as: instruction override

Definition

The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.

Where it attaches

The system components this risk arises at.

🧑 User💬 Chat / App Interface🧩 Prompt Assembly🧠 LLM✂️ Tokenizer🛡️ Input Guardrail📈 Monitoring & Evals🔤 Text / CLIP Encoder

Detection signals

▸ Outputs containing fragments of the system prompt
▸ Sudden tone/policy shift mid-conversation
▸ Refusal-rate anomalies for a user or session
▸ Inputs matching known override phrasings

Controls & guardrails that address this

163 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 8

Role-based access controls

Design the system prompt architecture with privilege separation and trust tier definitions at design stage.

Lifecycle stages1 – Use Case Context & Design4 – Deployment

Also addressesKnowledge / Training Data Poisoning Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Jailbreak detection

Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

Also addressesInference-Time & Serving-Layer Manipulation

Spotlighting of untrusted content via delimiting, datamarking and encoding

Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage3 – Onboarding, Build & Review

Dedicated injection-detection classifier on all inbound untrusted content and outbound actions

Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.

source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change

Multimodal input-fidelity check: show/verify the model-delivered (post-downscale) image and avoid silent lossy resampling✚ proposed

Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.

source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)

Lifecycle stage3 – Development & Build

Instruction-hierarchy-trained model selection with role-precedence injection evals✚ proposed

Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.

source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)

Lifecycle stage3 – Onboarding, Build & Review

Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

Also addressesJailbreak Capability / Architecture Disclosure

Least-privilege identity & scoped credentialsinteractive

Giving the agent only the keys it needs for the current task, not a master key to everything.

Also addressesIndirect Prompt Injection Sensitive Data Leakage Excessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents Resource Exhaustion / Denial of Wallet Capability / Architecture Disclosure

Detective · 6

Vulnerability assessment

Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Penetration testing

Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Continuous adversarial prompt-injection red teaming with regression suite in CI/CD

Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.

source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Materialised model-context audit capture (post-truncation prompt, retrieved and tool content) with read-time redaction✚ proposed

Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.

source: Interactive-control reconciliation: ctrl-logging (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Also addressesJailbreak Sensitive Data Leakage Distributed / Cross-Agent Jailbreak Capability / Architecture Disclosure Harmful / Non-Consensual Media Generation

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 3

Red teaming

Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Data/instruction trust-boundary enforcement with capability gating on injection-reachable tools

Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.

source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least Privilege

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

Spotlighting of untrusted content via delimiting, datamarking and encoding

Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage5 – Usage, Monitoring & Change

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM01:2025 Prompt Injection

MITRE ATLAS

AML.T0051 LLM Prompt Injection

NIST AI RMF

MEASURE 2.7
MANAGE 2.4

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Bing 'Sydney' system-prompt leak2023

Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.

Amazon Q Developer 'wiper' prompt shipped via poisoned pull request (CVE-2025-8217)2025

An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.

The Attacker Moves Second — adaptive attacks bypass 12 jailbreak/injection defenses (Nasr, Carlini et al.)2025

Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.

SearchLeak — Microsoft 365 Copilot one-click data theft (CVE-2026-42824)2026

A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.

Browse all real-world cases →

Practise this in an interactive scenario

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

Prompt Injection (direct)

Definition

Where it attaches

Detection signals

Controls & guardrails that address this

Framework mappings

Real-world cases

Practise this in an interactive scenario

Related risks