🔍AI RiskAtlas
← Risk taxonomy

Prompt Injection (direct)

highInput manipulation
Also known as: instruction override

Definition

The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.

Where it attaches

The system components this risk arises at.

🧑 User💬 Chat / App Interface🧩 Prompt Assembly🧠 LLM✂️ Tokenizer🛡️ Input Guardrail📈 Monitoring & Evals🔤 Text / CLIP Encoder

Detection signals

  • Outputs containing fragments of the system prompt
  • Sudden tone/policy shift mid-conversation
  • Refusal-rate anomalies for a user or session
  • Inputs matching known override phrasings

Controls & guardrails that address this

163 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 8
Role-based access controls

Design the system prompt architecture with privilege separation and trust tier definitions at design stage.

Lifecycle stages1 – Use Case Context & Design4 – Deployment
Jailbreak detection

Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Spotlighting of untrusted content via delimiting, datamarking and encoding

Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)
Lifecycle stage3 – Onboarding, Build & Review
Dedicated injection-detection classifier on all inbound untrusted content and outbound actions

Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.

source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change
Multimodal input-fidelity check: show/verify the model-delivered (post-downscale) image and avoid silent lossy resampling✚ proposed

Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.

source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)
Lifecycle stage3 – Development & Build
Instruction-hierarchy-trained model selection with role-precedence injection evals✚ proposed

Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.

source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)
Lifecycle stage3 – Onboarding, Build & Review
Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

Least-privilege identity & scoped credentialsinteractive

Giving the agent only the keys it needs for the current task, not a master key to everything.

Detective · 6
Vulnerability assessment

Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).

Penetration testing

Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Continuous adversarial prompt-injection red teaming with regression suite in CI/CD

Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.

source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Materialised model-context audit capture (post-truncation prompt, retrieved and tool content) with read-time redaction✚ proposed

Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.

source: Interactive-control reconciliation: ctrl-logging (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Corrective · 3
Red teaming

Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.

Data/instruction trust-boundary enforcement with capability gating on injection-reachable tools

Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.

source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least Privilege
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Spotlighting of untrusted content via delimiting, datamarking and encoding

Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)
Lifecycle stage5 – Usage, Monitoring & Change
Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
  • LLM01:2025 Prompt Injection
MITRE ATLAS
  • AML.T0051 LLM Prompt Injection
NIST AI RMF
  • MEASURE 2.7
  • MANAGE 2.4

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗