Definition
The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Outputs containing fragments of the system prompt
- ▸ Sudden tone/policy shift mid-conversation
- ▸ Refusal-rate anomalies for a user or session
- ▸ Inputs matching known override phrasings
Controls & guardrails that address this
163 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Design the system prompt architecture with privilege separation and trust tier definitions at design stage.
Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.
Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.
source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.
source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.
source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).
Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.
Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.
source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.
source: Interactive-control reconciliation: ctrl-logging (partial coverage)A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.
Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.
source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least PrivilegeRe-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Framework mappings
- LLM01:2025 Prompt Injection
- AML.T0051 LLM Prompt Injection
- MEASURE 2.7
- MANAGE 2.4
Real-world cases
4Actual published events that illustrate this risk — click through for the writeup and sources.
Users extracted Bing Chat's hidden system instructions and internal codename 'Sydney' via direct prompt injection shortly after launch.
An attacker got a malicious pull request merged into the open-source aws-toolkit-vscode repo, embedding a destructive prompt that told the Amazon Q agent to wipe local files and AWS resources; the tainted build (v1.84.0) reached the Marketplace's ~1M installs before removal.
Researchers report that adaptive attackers bypass 12 recent jailbreak and prompt-injection defenses with attack success rates above 90% for most, despite those defenses having originally reported near-zero success rates.
A single malicious link reportedly turned Copilot Enterprise Search's URL query parameter into an executable prompt, exfiltrating emails, MFA codes and files via a Bing image-search side channel.