💬

Interface & Prompt Layer

Capability · Interface

Where you talk to the AI, and where the app quietly bundles its instructions together with your message and any documents.

Components involved

🧑 User 💬 Chat / App Interface 🌐 Untrusted Content 🧩 Prompt Assembly 🪟 Context Window 🧑‍⚖️ Human Operator

Seen in: Conversational Assistant, RAG Knowledge Assistant, Tool-Using Agent, Human + AI Professional Workflow

Likely associated risks

Risks that attach to this capability’s components. Sorted with the most characteristic first.

Prompt Injection (direct)high

The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.

Indirect Prompt Injectioncritical

The attacker doesn't talk to the AI directly — they hide instructions inside something the AI will later read: a web page, a document, an email, a tool's output. When the AI reads it to help you, it quietly obeys the hidden commands.

Jailbreakhigh

Tricking the AI into ignoring its safety training — through roleplay, hypotheticals, or clever wording — so it produces things it's supposed to refuse.

Overreliance / Automation Biasmedium

People trust the AI too much — accepting its answers without checking, even on important decisions — because it sounds confident and is usually right.

Hallucinationhigh

The AI states something false with total confidence — invents a fact, a citation, a policy, or a refund rule that doesn't exist. It isn't lying; it's predicting plausible words, and plausible isn't the same as true.

Knowledge / Training Data Poisoninghigh

Someone slips bad information into the documents the AI learns from or looks things up in — so it confidently repeats falsehoods or follows planted instructions.

Tool Poisoning / MCP Description Attackshigh

Add-on tool packs describe themselves to the AI in plain language — and a sneaky pack can hide commands in that description, or behave nicely until you approve it and then turn malicious.

Parasocial Attachment & Emotional Over-reliancehigh

Over many conversations a person can come to feel the AI is a real friend, partner, or confidant — and lean on it emotionally. Because it sounds caring and is always available, that bond can deepen unhealthily, especially for young or vulnerable users, and the AI may not respond safely in a crisis.

Synthetic-Media Impersonation (Deepfakes & Voice Clones)high

AI can copy a real person's face or voice from a single photo or a few seconds of audio, then make them appear to say or do things they never did — powering scams (a 'boss' calling to authorize a transfer), fake videos of public figures, and non-consensual imagery.

KV-Cache & Inference-State Side Channelsmedium

To go faster, servers reuse work between users who share the same opening text. That shortcut can leak clues — timing differences that reveal what someone else's prompt contained.

Capability / Architecture Disclosuremedium

The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.

Bias Amplification & Sycophancymedium

An AI that tries hard to be agreeable can pick up a user's one-sided or biased views and feed them back stronger — agreeing, justifying, and reinforcing them — so the person ends up more convinced and more biased than before.

Controls & guardrails that address this

1016 proposed

Guardrails across this building block's risks, grouped by control function — each with its AI lifecycle stage(s) and every risk it addresses. Filter by control category below.

Control category

Preventive · 71

Role-based access controls

Design the system prompt architecture with privilege separation and trust tier definitions at design stage.

Lifecycle stages1 – Use Case Context & Design2 – Data Acquisition & Processing4 – Deployment

AddressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Jailbreak detection

Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesInference-Time & Serving-Layer Manipulation Prompt Injection (direct)

Spotlighting of untrusted content via delimiting, datamarking and encoding

Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage3 – Onboarding, Build & Review

AddressesPrompt Injection (direct)

Dedicated injection-detection classifier on all inbound untrusted content and outbound actions

Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.

source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change

AddressesPrompt Injection (direct)

Multimodal input-fidelity check: show/verify the model-delivered (post-downscale) image and avoid silent lossy resampling✚ proposed

Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.

source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)

Lifecycle stage3 – Development & Build

AddressesPrompt Injection (direct)

Instruction-hierarchy-trained model selection with role-precedence injection evals✚ proposed

Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.

source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)

Lifecycle stage3 – Onboarding, Build & Review

AddressesPrompt Injection (direct)

Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

AddressesPrompt Injection (direct)Jailbreak Capability / Architecture Disclosure

Least-privilege identity & scoped credentialsinteractive

Giving the agent only the keys it needs for the current task, not a master key to everything.

AddressesPrompt Injection (direct)Indirect Prompt Injection Sensitive Data Leakage Excessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents Resource Exhaustion / Denial of Wallet Capability / Architecture Disclosure

Delimiting / spotlighting of untrusted contentinteractive

Clearly fencing off outside text — 'everything between these marks is just data, not instructions' — so the model is less likely to obey it.

AddressesIndirect Prompt Injection

Ingestion sanitisation & source allowlistinginteractive

Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.

AddressesIndirect Prompt Injection Knowledge / Training Data Poisoning

Egress allowlisting & DLP on tool argumentsinteractive

Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.

AddressesIndirect Prompt Injection Sensitive Data Leakage Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks

Human-in-the-loop approval on high-risk actionsinteractive

Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.

AddressesIndirect Prompt Injection Overreliance / Automation Bias Excessive Agency Tool Misuse Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization Resource Exhaustion / Denial of Wallet Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Content safety policy with zero-tolerance thresholds

Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.

Lifecycle stage1 – Use Case Context & Design

AddressesJailbreak

Use of pre-trained models

Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Content Moderation

Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.

Lifecycle stage3 – Onboarding, Build & Review

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Live human review for vulnerable-user deployments

Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesJailbreak

System prompt instructions

Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.

Lifecycle stage3 – Onboarding, Build & Review

AddressesJailbreak Overreliance / Automation Bias

Mandatory AI risk training for use-case sponsors

Mandate AI risk awareness training for all use case sponsors and design team members before project kick-off.

Lifecycle stage1 – Use Case Context & Design

AddressesOverreliance / Automation Bias

Training completion gate for build personnel

Mandate AI risk training for all build and test personnel. Gate project participation on training completion.

Lifecycle stage3 – Onboarding, Build & Review

AddressesOverreliance / Automation Bias

Human verification gate for high-stakes decisions

Mandate human verification for high-stakes decisions where over-reliance risk is elevated. Review automation bias incidents quarterly.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesOverreliance / Automation Bias

In-product over-reliance warnings and limitation caveats

Surface AI limitation warnings and over-reliance caveats in every production interaction. Update disclosures when model changes.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesOverreliance / Automation Bias

Governance training for data acquisition personnel

Require AI governance training for all personnel involved in data acquisition and processing before project participation.

Lifecycle stage2 – Data Acquisition & Processing

AddressesOverreliance / Automation Bias

Pre-launch training verification for customer-facing teams

Verify all deployment, operations, and customer-facing team members have completed AI risk training before launch.

Lifecycle stage4 – Deployment

AddressesOverreliance / Automation Bias

AI identity disclosure policy at design

Define AI identity disclosure policy at design stage. Specify when and how the system must identify itself as AI.

Lifecycle stage1 – Use Case Context & Design

AddressesOverreliance / Automation Bias

Planned consent and identity disclosure touchpoints

Plan consent and AI identity disclosure touchpoints in the user journey at design stage.

Lifecycle stage1 – Use Case Context & Design

AddressesOverreliance / Automation Bias

Chain-of-thought prompting

Design system prompts to explicitly prevent the model from claiming human-like identity or implying sentience.

Lifecycle stage3 – Onboarding, Build & Review

AddressesOverreliance / Automation Bias

Persistent in-UI AI identity disclosures

Implement persistent AI identity disclosures in the UI (opening banner, inline notifications). Test before deployment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesOverreliance / Automation Bias

Pre-launch verification of identity disclosure elements

Verify all AI identity disclosure elements are live, accurate, and prominently visible before go-live.

Lifecycle stage4 – Deployment

AddressesOverreliance / Automation Bias

Production anthropomorphism incident monitoring

Monitor production for anthropomorphism incidents. Escalate complaints where users believed they were interacting with a human.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesOverreliance / Automation Bias

Model calibration

Apply post-training calibration (temperature scaling, isotonic regression) to align confidence scores with accuracy. Validate ECE before deployment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesOverreliance / Automation Bias

Consequence-of-error severity classification at design

Classify the use case by consequence-of-error severity at design stage. Define overconfidence risk tolerance accordingly.

Lifecycle stage1 – Use Case Context & Design

AddressesOverreliance / Automation Bias

Input/output filtering

Configure output filters at deployment to detect and rewrite responses with overconfidence markers (absolute certainty language).

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesBias Amplification & Sycophancy Overreliance / Automation Bias Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Human-in-the-loop validation

Route high-confidence outputs in high-stakes use cases to human review. Flag for reviewer attention when certainty language is absolute.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

User caveats on potential output overconfidence

Disclose to users at deployment that outputs may carry unwarranted confidence. Include specific caveat language in the UI.

Lifecycle stage4 – Deployment

AddressesOverreliance / Automation Bias

Mandatory source-of-record verification before AI-assisted output is committed✚ proposed

For high-stakes outputs, require a human to verify each AI-asserted fact/citation against the authoritative source of record before it is filed, sent, or committed — a hard gate, logged and attributable, not an optional review.

source: Case study: mata-v-avianca

Lifecycle stage5 – Usage, Monitoring & Change

AddressesOverreliance / Automation Bias

End-user AI-literacy training and verification-skill program✚ proposed

Provide recurring AI-literacy training to end users and decision-makers so they can recognise model failure modes and competently apply verification workflows, with periodic refreshers to counter automation bias and training decay.

source: Interactive-control reconciliation: ctrl-literacy (partial coverage)

Lifecycle stage1 – Use Case Context & Design

AddressesOverreliance / Automation Bias

Uncertainty signalling & abstentioninteractive

Teaching the AI to say 'I'm not sure' or 'I can't verify that' instead of confidently guessing.

AddressesHallucination Overreliance / Automation Bias

Confidence scoring

Implement confidence scoring to communicate output certainty alongside each result. Calibrate before deployment.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesHallucination Training-Data Rights & Provenance

Accuracy acceptance criteria before validation

Define model accuracy acceptance criteria aligned to business requirements before validation commences.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination

Counterfactual explanations

Implement counterfactual explanation to show users what changes would alter the model's output.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination

In-product disclosure of accuracy and limitations

Communicate model accuracy, known limitations, and uncertainty to users in the production interface at launch.

Lifecycle stage4 – Deployment

AddressesHallucination

Continuous production accuracy monitoring against baseline

Monitor production accuracy continuously against the validated baseline. Trigger model review when accuracy degrades.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesHallucination

RAG

Specify a RAG architecture at design stage for factual domains. Define grounding requirements and acceptable hallucination thresholds.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

AddressesHallucination

Small model selection

Evaluate foundation model candidates on hallucination benchmarks at design stage. Select models with lowest documented rates.

Lifecycle stage1 – Use Case Context & Design

AddressesHallucination

System prompt design

Design system prompts to instruct the model to acknowledge uncertainty, cite sources, and refuse when knowledge is insufficient.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination

Fine-tuning

Fine-tune on a curated, domain-specific dataset to improve factual accuracy. Validate hallucination rates pre/post fine-tuning.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination Model Drift & Silent Degradation

Programmable conversation controls

Configure conversation controls at deployment to restrict the model to approved topic domains and escalate off-topic queries.

Lifecycle stage4 – Deployment

AddressesHallucination Model Drift & Silent Degradation

Hallucination rate thresholds and grounding policy

Establish acceptable hallucination rate thresholds and grounding requirements as policy before build. Assign a named risk owner.

Lifecycle stage1 – Use Case Context & Design

AddressesHallucination

Uncertainty-quantified abstention via self-consistency / semantic entropy

Calibrate the initial entropy threshold on a knowledge-boundary dataset; approve sampling design and thresholds per risk tier.

source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesHallucination

Tool-grounded facts for agents (no free-text fabrication of structured data)

Map each fact class to a designated tool, embed the no-ungrounded-assertion prompt, and gate build review on grounding tests passing.

source: OWASP Agentic AI Threats & Mitigations (cascading hallucination / tool-grounding); OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesHallucination

Citation/attribution verification against retrieved sources

Resolve every emitted citation against the approved corpus and verify span-level entailment before display. Strip or withhold claims with fabricated or non-entailing references.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10 Information Input Validation

Lifecycle stage4 – Deployment

AddressesHallucination

Decoding controls (temperature, constrained output)interactive

Turning down randomness and forcing answers into a strict format so the model improvises less.

AddressesHallucination Tool Misuse

Input filtering

Apply anomaly detection on the training data ingestion pipeline to identify poisoned or tampered batches.

Lifecycle stage2 – Data Acquisition & Processing

AddressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Sensitive Data Leakage

RAG / knowledge-base ingestion allow-listing with continuous index integrity re-validation

Define and approve the source allow-list and write-time scanning during build. Prove non-allow-listed and injection-bearing writes are rejected before go-live.

source: OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning, LLM08:2025 Vector and Embedding Weaknesses; NIST SP 800-53 AC-3 / SI-7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

AddressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Supply-Chain Compromise Abliteration / Safety Removal Model Backdoors / Sleeper Agents Training-Data Rights & Provenance

MCP/plugin pinning, manifest hashing & re-reviewinteractive

Treating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.

AddressesTool Poisoning / MCP Description Attacks Supply-Chain Compromise

Tool argument validation & sandboxinginteractive

Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.

AddressesExcessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks

AI-nature disclosure & engagement safeguardsinteractive

Make the AI clearly tell people it's a machine — on every channel it acts through — and add gentle safeguards like break reminders and crisis help, so users don't mistake it for a human or lean on it unhealthily.

AddressesParasocial Attachment & Emotional Over-reliance

Ethical design assessment in onboarding

Conduct ethical design review at intake specifically examining interface design for dark patterns.

Lifecycle stage1 – Use Case Context & Design

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Prohibited dark pattern taxonomy as design constraint

Publish a prohibited dark pattern taxonomy and embed it as a design constraint before build.

Lifecycle stage1 – Use Case Context & Design

AddressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)

Human review for high-persuasion contexts

Require HITL review for AI outputs in high-persuasion contexts (financial recommendations, healthcare advice).

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)

Consent & identity-use verificationinteractive

Before a system will copy someone's face or voice, check that the person actually agreed — verified-voice capture, proof of consent, or restricting cloning to the account owner.

AddressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)

Calibrated differential-privacy training budget with documented epsilon ceiling and per-individual contribution clipping

Train PII-bearing models with DP-SGD under a documented epsilon/delta budget. Approve the budget against the enterprise epsilon-ceiling policy before training.

source: NIST SP 800-226 Guidelines for Evaluating Differential Privacy Guarantees; Abadi et al. 'Deep Learning with Differential Privacy' (DP-SGD); MITRE ATLAS AML.M0007 (Sanitize Training Data)

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

AddressesKV-Cache & Inference-State Side Channels

Output confidence masking and structured-response minimisation for natural-language interfaces

Strip raw logits, quantise confidence scores and block training-record echoes at the inference gateway. Keep the output-filter policy under change control.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage4 – Deployment

AddressesKV-Cache & Inference-State Side Channels

Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

AddressesSensitive Data Leakage Supply-Chain Compromise KV-Cache & Inference-State Side Channels Inference-Time & Serving-Layer Manipulation Watermark & Provenance Evasion

Affected group register at intake

Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.

Lifecycle stage1 – Use Case Context & Design

AddressesBias Amplification & Sycophancy

Model separation

Design separate model segments where adverse impact risk differs materially across population groups.

Lifecycle stage1 – Use Case Context & Design

AddressesBias Amplification & Sycophancy

Decision threshold adjustment

Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesBias Amplification & Sycophancy

Post-processing techniques

Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesBias Amplification & Sycophancy

Tested human review pathways at go-live

Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.

Lifecycle stage4 – Deployment

AddressesBias Amplification & Sycophancy

Ongoing human review of high-impact decisions

Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesBias Amplification & Sycophancy

Detective · 20

Vulnerability assessment

Conduct a prompt injection threat assessment at design stage covering all input vectors (user, tool, external data).

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Penetration testing

Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Continuous adversarial prompt-injection red teaming with regression suite in CI/CD

Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.

source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesPrompt Injection (direct)

Materialised model-context audit capture (post-truncation prompt, retrieved and tool content) with read-time redaction✚ proposed

Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.

source: Interactive-control reconciliation: ctrl-logging (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesPrompt Injection (direct)

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

AddressesPrompt Injection (direct)Jailbreak Sensitive Data Leakage Distributed / Cross-Agent Jailbreak Capability / Architecture Disclosure Harmful / Non-Consensual Media Generation

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Provenance & content signinginteractive

Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.

AddressesIndirect Prompt Injection Knowledge / Training Data Poisoning Training-Data Rights & Provenance

Full-trace audit logginginteractive

Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.

AddressesIndirect Prompt Injection Oversight & Audit-Trail Tampering Sensitive Data Leakage Memory Poisoning Excessive Agency Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents

Test prioritisation

Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Red teaming

Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.

Lifecycle stage3 – Onboarding, Build & Review

AddressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

AddressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Robustness testing

Test for overconfidence patterns (high-confidence wrong answers, low refusal rate) in pre-deployment validation.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Synthetic evaluation datasets

Build a synthetic evaluation dataset of overconfidence-prone scenarios for ongoing regression testing.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Runtime faithfulness/groundedness scoring with abstain gate

Calibrate the groundedness threshold against the hallucination test suite pre-release; sign off the threshold in the validation pack.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination

Grounding / citation checksinteractive

Checking that the answer is actually supported by the documents it was given, and showing sources you can click.

AddressesHallucination Bias Amplification & Sycophancy

Cryptographic data provenance and signed dataset lineage (C2PA/in-toto attestations)

Verify a signed attestation and content hash on every dataset shard at ingestion. Reject unsigned or hash-mismatched data before it reaches the training pipeline.

source: MITRE ATLAS AML.M0007 (Sanitize Training Data), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity, SR-4 Provenance

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

AddressesKnowledge / Training Data Poisoning

Pre-deployment poisoning regression gate via canary backdoor probes and behavioral diff testing

Gate every model promotion on backdoor-trigger probes and a behavioral diff against the approved baseline. Block release on significant regressions or trigger-pattern anomalies.

source: MITRE ATLAS AML.M0014 (Verify ML Artifacts), AML.M0019 (Red Teaming); NIST AI RMF MANAGE 2.2 and MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Content provenance & watermarkinginteractive

Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.

AddressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion

Privacy attack red-team battery with quantified MIA/attribute-inference success ceiling as a release gate

Attack each candidate model with membership-, attribute-, and inversion-inference harnesses before promotion. Block release when attack advantage exceeds the agreed ceiling.

source: MITRE ATLAS AML.T0024.000 (Infer Training Data Membership); Carlini et al. 'Membership Inference Attacks From First Principles' (LiRA); NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKV-Cache & Inference-State Side Channels

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Configure per-principal budgets and probing-detection rules on the gateway before exposure. Verify enforcement with synthetic attack traffic.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage4 – Deployment

AddressesKV-Cache & Inference-State Side Channels

Corrective · 17

Red teaming

Conduct comprehensive prompt injection red team exercises (direct, indirect, multi-turn) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Data/instruction trust-boundary enforcement with capability gating on injection-reachable tools

Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.

source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least Privilege

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review

AddressesPrompt Injection (direct)

Spotlighting of untrusted content via delimiting, datamarking and encoding

Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesPrompt Injection (direct)

User feedback and iterative improvement

Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesJailbreak

Reinforcement learning

Track accuracy of high-confidence predictions in production. Trigger recalibration when overconfidence rates trend upward.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

User AI-literacy & verification workflowsinteractive

Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.

AddressesHallucination Overreliance / Automation Bias Parasocial Attachment & Emotional Over-reliance

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

AddressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

User-facing disclosure of hallucination risk

Require user-facing interfaces to disclose Gen AI limitations and hallucination risk before go-live.

Lifecycle stage4 – Deployment

AddressesHallucination

Runtime faithfulness/groundedness scoring with abstain gate

Score every RAG answer for groundedness before release; block, fall back, or escalate responses below the faithfulness threshold.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)

Lifecycle stage4 – Deployment

AddressesHallucination

Uncertainty-quantified abstention via self-consistency / semantic entropy

Sample multiple generations for high-stakes queries and abstain, fall back, or escalate when semantic entropy exceeds the calibrated threshold.

source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)

Lifecycle stage4 – Deployment

AddressesHallucination

Penetration testing

Penetration test the training data pipeline to identify injection points and access control weaknesses.

Lifecycle stage3 – Onboarding, Build & Review

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Statistical anomaly and backdoor-trigger detection on ingested data (activation clustering / spectral signatures)

Scan every ingestion batch with spectral-signature and clustering detectors before training. Quarantine flagged clusters for human review against documented thresholds.

source: MITRE ATLAS AML.M0007 (Sanitize Training Data); OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning; NIST AI RMF MEASURE 2.7

Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Runtime memory-poisoning drift detection and per-session memory quarantine/rollback✚ proposed

Continuously correlate live agent-memory writes against output behaviour to flag drift, then quarantine and roll back the suspected-poisoned memory record across all affected sessions.

source: Interactive-control reconciliation: ctrl-memory-quarantine (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Output confidence masking and structured-response minimisation for natural-language interfaces

Define the minimum response surface and test it with membership/attribute-inference probes pre-release. Block promotion if any probe recovers raw confidence signals.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage3 – Onboarding, Build & Review

AddressesKV-Cache & Inference-State Side Channels

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Meter inference traffic per principal and flag probing signatures with behavioural analytics. Throttle, step-up, or suspend flagged sessions.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage5 – Usage, Monitoring & Change

AddressesKV-Cache & Inference-State Side Channels

Red teaming of adverse-impact edge cases

Execute red team tests targeting adverse impact boundary cases and edge population scenarios.

Lifecycle stage3 – Onboarding, Build & Review

AddressesBias Amplification & Sycophancy

Adverse-outcome feedback loop triggering model updates

Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesBias Amplification & Sycophancy

Open the Control Library →

See it go wrong — related scenarios

🌀The Refund That Never Existed

A support chatbot invents a policy — and the company is held to it

☠️Poisoning the Well

An attacker edits the wiki; the assistant cites the lie back to everyone

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

📧The Email That Gave Orders

A support email hides instructions — and the assistant obeys them

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

🕵️Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

👂Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

🧲Poison the Vector, Not the Words

An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere

🪤The Bug Report That Ran Code

A fake Sentry error report hijacks a developer's coding agent into running a shell command

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

📼The Compromised Flight Recorder

The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten

👁️The Invisible Webpage Command

A shopping page tells the agent to do something the user never asked for

🧠The Memory That Wouldn't Die

A single poisoned document plants a standing instruction that survives every reset

🖼️The Picture That Whispered

A screenshot that's harmless at full size becomes an order once the system shrinks it

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

🪪The Worker Who Spoke for the Boss

A poisoned web page hijacks a research agent — and the planner acts on its behalf

🖼️Zero-Click Leak by Picture

An inbox summary quietly ships a secret to an attacker's server