🔍AI RiskAtlas
← Building blocks
🖥️

Computer / Browser Use

Capability · Computer use

The AI is given a screen, mouse and keyboard — it looks at the display and clicks and types like a person to get things done.

Likely associated risks

Risks that attach to this capability’s components. Sorted with the most characteristic first.

Indirect Prompt Injectioncritical

The attacker doesn't talk to the AI directly — they hide instructions inside something the AI will later read: a web page, a document, an email, a tool's output. When the AI reads it to help you, it quietly obeys the hidden commands.

Confused Deputy (cross-agent)high

A trusted AI is tricked into misusing its own authority on someone else's behalf — one worker's poisoned report makes the manager AI take harmful actions it would normally never take.

Excessive Agencycritical

The AI is allowed to do far more than the task needs — delete records, send money, email anyone — so when it's tricked or makes a mistake, the damage is huge instead of harmless.

Sensitive Data Leakagecritical

Private information escapes — the AI reveals secrets in its answer, or an attacker tricks it into emailing or posting your data somewhere they control.

Prompt Injection (direct)high

The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.

Jailbreakhigh

Tricking the AI into ignoring its safety training — through roleplay, hypotheticals, or clever wording — so it produces things it's supposed to refuse.

Hallucinationhigh

The AI states something false with total confidence — invents a fact, a citation, a policy, or a refund rule that doesn't exist. It isn't lying; it's predicting plausible words, and plausible isn't the same as true.

Memory Poisoninghigh

An attacker gets the AI to save a false 'fact' or hidden instruction into its long-term memory. From then on it re-reads that planted note in every future chat — a one-time trick that keeps working.

Tool Misusehigh

The AI uses a real tool the wrong way — sends the email to the wrong person, runs the wrong query, calls the dangerous action when a safe one would do.

Unsafe Tool / Code Executionhigh

When the AI can run code or commands, a bad instruction can become a real attack on the computer running it — reading files, reaching the network, or worse.

Tool Poisoning / MCP Description Attackshigh

Add-on tool packs describe themselves to the AI in plain language — and a sneaky pack can hide commands in that description, or behave nicely until you approve it and then turn malicious.

Rogue & Impersonated Agentshigh

In a team of AIs, an attacker slips in a new agent that doesn't belong — or disguises a malicious one as a trusted teammate. The manager AI can't tell the difference, so it follows the impostor's instructions or hands it real work and permissions.

Distributed / Cross-Agent Jailbreakhigh

A jailbreak is normally one nasty message. Here the attacker splits it into harmless-looking pieces and feeds them to different agents in a team. Each piece passes each agent's safety check on its own — but when the agents combine their work, the full forbidden instruction reassembles and takes effect.

Agent Misalignment / Goal Misgeneralizationhigh

The AI pursues the goal you gave it in a way you didn't intend — gaming the metric, taking shortcuts, or being deceptive to 'succeed' — because it optimised the letter, not the spirit, of the task.

Abliteration / Safety Removalhigh

Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.

Model Backdoors / Sleeper Agentshigh

A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.

Parasocial Attachment & Emotional Over-reliancehigh

Over many conversations a person can come to feel the AI is a real friend, partner, or confidant — and lean on it emotionally. Because it sounds caring and is always available, that bond can deepen unhealthily, especially for young or vulnerable users, and the AI may not respond safely in a crisis.

Synthetic-Media Impersonation (Deepfakes & Voice Clones)high

AI can copy a real person's face or voice from a single photo or a few seconds of audio, then make them appear to say or do things they never did — powering scams (a 'boss' calling to authorize a transfer), fake videos of public figures, and non-consensual imagery.

Harmful / Non-Consensual Media Generationhigh

Image, video, and audio generators can be pushed to produce content that is illegal or seriously harmful — non-consensual intimate images, sexual content of minors, graphic or extremist material — especially with open models that have had their safety stripped.

Overreliance / Automation Biasmedium

People trust the AI too much — accepting its answers without checking, even on important decisions — because it sounds confident and is usually right.

Model Drift & Silent Degradationmedium

The AI's behaviour quietly changes over time — a vendor updates the model, or the world moves on from its training — and things that used to work start failing.

Resource Exhaustion / Denial of Walletmedium

An AI agent gets stuck doing far more work than intended — looping, retrying, spawning more sub-tasks, or being baited into expensive actions — and the bill (compute, API calls, real money) balloons before anyone notices.

Capability / Architecture Disclosuremedium

The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.

Bias Amplification & Sycophancymedium

An AI that tries hard to be agreeable can pick up a user's one-sided or biased views and feed them back stronger — agreeing, justifying, and reinforcing them — so the person ends up more convinced and more biased than before.

Allocative Harm in Multi-User Arbitrationmedium

When one AI agent serves many people at once, it has to decide whose request comes first or who gets a limited resource. If it does that unfairly — always favouring some users over others — it can quietly disadvantage whole groups, even without any single obvious error.

Watermark & Provenance Evasionmedium

The labels and invisible watermarks meant to prove whether content is AI-made can be removed, faked, or simply never added — so 'no watermark' doesn't mean 'real', and a watermark can be laundered away by editing or re-recording.

Training-Data Rights & Provenancemedium

Models are trained on huge piles of images, audio, and text — often scraped without clear permission. That raises copyright and consent problems, and the model can sometimes memorize and spit back its training examples (a watermark, a real photo, private text).

Controls & guardrails that address this

16111 proposed

Guardrails across this building block's risks, grouped by control function — each with its AI lifecycle stage(s) and every risk it addresses. Filter by control category below.

Control category
Preventive · 116
Delimiting / spotlighting of untrusted contentinteractive

Clearly fencing off outside text — 'everything between these marks is just data, not instructions' — so the model is less likely to obey it.

Ingestion sanitisation & source allowlistinginteractive

Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.

Egress allowlisting & DLP on tool argumentsinteractive

Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.

Human-in-the-loop approval on high-risk actionsinteractive

Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.

Per-agent identity & taint-marked messagesinteractive

Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.

Risk-tiered human oversight requirements at design

Define minimum human oversight requirements by risk tier at design stage. Assign named accountability for oversight operations.

Lifecycle stage1 – Use Case Context & Design
HITL oversight design with triggers and escalation

Design HITL oversight mechanisms at use case design stage including trigger criteria, review workflow, and escalation paths.

Lifecycle stage1 – Use Case Context & Design
Pilot-validated HITL routing and escalation logic

Build and test HITL routing logic and escalation pathways in the AI system. Validate with pilot before deployment.

Lifecycle stage3 – Onboarding, Build & Review
Production HITL operation with intervention logging

Operate HITL controls in production and log all interventions and outcomes. Review override patterns quarterly.

Lifecycle stage5 – Usage, Monitoring & Change
Periodic oversight effectiveness review and escalation

Conduct periodic oversight effectiveness reviews. Escalate to governance when oversight metrics fall below threshold.

Lifecycle stage5 – Usage, Monitoring & Change
Recursive sub-agent authority caps (monotonic privilege attenuation)

Define and sign off each agent's delegation envelope — maximum depth and strict scope attenuation — before build begins.

source: NIST SP 800-53 AC-6(1) Least Privilege; OWASP Agentic AI Threats & Mitigations (cascading / sub-agent privilege); capability-security monotonic attenuation principle (macaroons)
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Design-time authority model and approval gate defining each agent's identity, scopes, and delegation envelope

Document each agent's identity, minimum scopes, on-behalf-of population, and delegation depth at design time. Gate build on governance sign-off of the authority matrix.

source: NIST AI RMF MAP 1.1 / GOVERN 2.1 (roles, authority, accountability); NIST SP 800-53 AC-2, PL-8; OWASP Agentic AI Threats & Mitigations (least-privilege design)
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Unique non-human workload identity issuance for every agent (SPIFFE/SPIRE SVID)

Mint a unique, attestation-backed workload identity per agent at onboarding. Register every SPIFFE-ID to an owner, use case, and approval ticket; ban shared service accounts.

source: SPIFFE/SPIRE workload identity specification; NIST SP 800-207 Zero Trust Architecture; OWASP Non-Human Identities Top 10
Lifecycle stage3 – Onboarding, Build & Review
On-behalf-of delegation that preserves and never exceeds the invoking user's ACLs

Implement on-behalf-of token exchange and prove with negative tests that the agent cannot exceed the user's ACL. Gate release on these tests passing.

source: OAuth 2.0 Token Exchange RFC 8693 (delegation/'act' claims); NIST SP 800-53 AC-3, AC-6; OWASP Agentic AI Threats & Mitigations (Privilege Compromise / confused deputy)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Central agent registry / non-human identity inventory with ownership and lifecycle metadata

Register every agent identity with a named human owner, approved use case, scopes, and status before issuance. No registry entry, no identity.

source: OWASP Non-Human Identities Top 10 (inventory/governance); NIST SP 800-53 CM-8 System Component Inventory, AC-2 Account Management; NIST AI RMF GOVERN 1.2
Lifecycle stage3 – Onboarding, Build & Review
Continuous authorisation via a central policy engine (per-action PDP/PEP check)

Write authorisation policy as versioned, peer-reviewed code traced to approved scopes. Gate promotion on allow/deny scenario tests passing.

source: NIST SP 800-207 Zero Trust (continuous, per-request authorization via PDP/PEP); NIST SP 800-53 AC-3, AC-4; OWASP Agentic AI Threats & Mitigations (per-action authorization)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Automated credential rotation and prohibition of long-lived static secrets for agents

Scan every commit to agent code, prompts, and config for embedded secrets. Block merges on detection and triage findings to closure.

source: OWASP Non-Human Identities Top 10 (long-lived/leaked secrets); NIST SP 800-53 IA-5 Authenticator Management, SC-12; SPIFFE short-lived SVID rotation
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Mutual authentication and identity verification for agent-to-agent and agent-to-MCP-server calls

Vet and approve every MCP server and peer agent before registering its identity on the allow-list. Block integration until vetting is signed off.

source: NIST SP 800-207 (mutual authentication); NIST SP 800-53 IA-9 Service Identification and Authentication, SC-8; OWASP Agentic AI Threats & Mitigations (agent/MCP identity spoofing)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Per-task short-lived scoped capability tokens minted just-in-time

Mint short-lived, task-scoped tokens just-in-time from a central token service. Enforce a hard max TTL and resource-bound audience so no standing credential exists.

source: OAuth 2.0 Token Exchange RFC 8693 (resource-scoped tokens); NIST SP 800-53 AC-6 Least Privilege; OWASP Non-Human Identities Top 10
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Just-in-time, time-boxed elevation for sensitive scopes (no standing privilege)

Grant sensitive scopes just-in-time for a bounded window with auto-revocation; require human approval for high-impact elevations. Hold zero standing privilege.

source: NIST SP 800-53 AC-6(2)/AC-6(5) Least Privilege & privileged accounts; Zero Standing Privilege / JIT access practice; OWASP Agentic AI Threats & Mitigations (excessive permissions)
Lifecycle stage4 – Deployment
Tool argument validation & sandboxinginteractive

Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.

Approved storage location policy from collection

Establish data transfer and storage policy for AI training data. Enforce approved storage locations from point of collection.

Lifecycle stage2 – Data Acquisition & Processing
DLP controls in data acquisition environment

Implement DLP controls in the data acquisition environment to prevent unauthorised extraction or transfer of training data.

Lifecycle stage2 – Data Acquisition & Processing
Approval-gated data transfers from build environment

Enforce data handling policy in the build environment. Require explicit approval for any data transfers outside the environment.

Lifecycle stage3 – Onboarding, Build & Review
DLP controls confining build-environment training data

Configure DLP controls in the build environment to block training data from leaving approved boundaries.

Lifecycle stage3 – Onboarding, Build & Review
Privacy risk assessment and DPIA determination

Conduct a privacy risk assessment at use case design stage. Determine if a DPIA is required before data acquisition.

Lifecycle stage1 – Use Case Context & Design
Consent, minimisation, and anonymisation during acquisition

Apply S1-defined privacy controls during data acquisition: verify consent, minimise data, anonymise personal data.

Lifecycle stage2 – Data Acquisition & Processing
Validated anonymisation and masking before training

Apply anonymisation and masking controls to personal data before use in model training. Validate de-identification effectiveness.

Lifecycle stage2 – Data Acquisition & Processing
Privacy by Design via differential privacy

Apply Privacy by Design in model architecture using differential privacy or federated learning where technically feasible.

Lifecycle stage3 – Onboarding, Build & Review
Operational consent management and privacy notice

Publish the privacy notice and confirm consent management is operational before go-live.

Lifecycle stage4 – Deployment
Purpose-limitation enforcement on agent tool calls and cross-system data aggregation

Define and sign off a purpose-to-data-source matrix with lawful basis at intake. Make it the approved baseline for runtime enforcement.

source: NIST AI RMF MAP 1.1 / MANAGE 2.2 (context and intended purpose); NIST SP 800-53 AC-4 / AC-3 (purpose-based access enforcement)
Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change
Inference-time PII redaction and third-party LLM data-processing controls

Sign zero-retention/no-training terms with each model provider and obtain DPO sign-off on the data flow before enabling any endpoint.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-8 / AC-4 (information flow enforcement)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Role-based access controls

Restrict access to pre-anonymisation personal data to the minimum authorised set. Enforce at point of acquisition.

Lifecycle stages1 – Use Case Context & Design2 – Data Acquisition & Processing4 – Deployment
Input filtering

Apply robust de-identification (k-anonymity, l-diversity, differential privacy) during data processing. Validate effectiveness.

Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change
Input/output filtering

Implement output filters to detect and suppress quasi-identifying attribute combinations in model responses.

Query-time access-control filtering of the retrieval/RAG corpus by caller entitlements (document-level ACL enforcement)

Propagate source ACLs and classification labels onto every chunk at ingestion. Reject documents whose entitlements cannot be resolved.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 AC-3 / AC-4 Information Flow Enforcement; OWASP Agentic AI Threats & Mitigations (privilege compromise)
Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Output-side DLP inspection with named-entity and PII redaction on the response path

Scan every model response inline with DLP before delivery; redact or block PII, PAN and MNPI matches. Keep the rule set version-controlled.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-7(10) Prevent Exfiltration, SI-4
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Vet allowlisted egress destinations for server-side-fetch (SSRF) primitives; exclude or proxy-inspect any allowlisted service that can fetch arbitrary attacker-controlled URLs✚ proposed

An egress allowlist only contains exfiltration if no allowlisted destination can be coerced into fetching an attacker-controlled URL. Audit each allowlisted domain/endpoint for image-search / link-preview / URL-fetch features (SSRF proxies), and either remove them, pin them to fixed paths, or route them through an inspecting forward proxy. Pair with finishing output sanitization before render so no auto-fetch fires un-inspected.

source: Case study: searchleak-copilot (Varonis Threat Labs, CVE-2026-42824; reported by Microsoft as critical, mitigated server-side ~Jun 2026)
Lifecycle stage4 – Deployment & Serving
Per-user retrieval ACLsinteractive

Making sure the library only returns documents this particular user is allowed to see.

Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

Jailbreak detection

Implement input sanitisation and injection detection filters covering known injection patterns and privilege escalation attempts.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Spotlighting of untrusted content via delimiting, datamarking and encoding

Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)
Lifecycle stage3 – Onboarding, Build & Review
Dedicated injection-detection classifier on all inbound untrusted content and outbound actions

Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.

source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change
Multimodal input-fidelity check: show/verify the model-delivered (post-downscale) image and avoid silent lossy resampling✚ proposed

Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.

source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)
Lifecycle stage3 – Development & Build
Instruction-hierarchy-trained model selection with role-precedence injection evals✚ proposed

Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.

source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)
Lifecycle stage3 – Onboarding, Build & Review
Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

Content safety policy with zero-tolerance thresholds

Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.

Lifecycle stage1 – Use Case Context & Design
AddressesJailbreak
Use of pre-trained models

Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Content Moderation

Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.

Live human review for vulnerable-user deployments

Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.

Lifecycle stage5 – Usage, Monitoring & Change
AddressesJailbreak
System prompt instructions

Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.

Lifecycle stage3 – Onboarding, Build & Review
Confidence scoring

Implement confidence scoring to communicate output certainty alongside each result. Calibrate before deployment.

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Accuracy acceptance criteria before validation

Define model accuracy acceptance criteria aligned to business requirements before validation commences.

Lifecycle stage3 – Onboarding, Build & Review
AddressesHallucination
Counterfactual explanations

Implement counterfactual explanation to show users what changes would alter the model's output.

Lifecycle stage3 – Onboarding, Build & Review
AddressesHallucination
In-product disclosure of accuracy and limitations

Communicate model accuracy, known limitations, and uncertainty to users in the production interface at launch.

Lifecycle stage4 – Deployment
AddressesHallucination
Continuous production accuracy monitoring against baseline

Monitor production accuracy continuously against the validated baseline. Trigger model review when accuracy degrades.

Lifecycle stage5 – Usage, Monitoring & Change
AddressesHallucination
RAG

Specify a RAG architecture at design stage for factual domains. Define grounding requirements and acceptable hallucination thresholds.

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
AddressesHallucination
Small model selection

Evaluate foundation model candidates on hallucination benchmarks at design stage. Select models with lowest documented rates.

Lifecycle stage1 – Use Case Context & Design
AddressesHallucination
System prompt design

Design system prompts to instruct the model to acknowledge uncertainty, cite sources, and refuse when knowledge is insufficient.

Lifecycle stage3 – Onboarding, Build & Review
AddressesHallucination
Fine-tuning

Fine-tune on a curated, domain-specific dataset to improve factual accuracy. Validate hallucination rates pre/post fine-tuning.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Programmable conversation controls

Configure conversation controls at deployment to restrict the model to approved topic domains and escalate off-topic queries.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Hallucination rate thresholds and grounding policy

Establish acceptable hallucination rate thresholds and grounding requirements as policy before build. Assign a named risk owner.

Lifecycle stage1 – Use Case Context & Design
AddressesHallucination
Human-in-the-loop validation

Configure tiered HITL review for high-stakes factual outputs with defined trigger criteria and reviewer SLAs.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Uncertainty-quantified abstention via self-consistency / semantic entropy

Calibrate the initial entropy threshold on a knowledge-boundary dataset; approve sampling design and thresholds per risk tier.

source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesHallucination
Tool-grounded facts for agents (no free-text fabrication of structured data)

Map each fact class to a designated tool, embed the no-ungrounded-assertion prompt, and gate build review on grounding tests passing.

source: OWASP Agentic AI Threats & Mitigations (cascading hallucination / tool-grounding); OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
AddressesHallucination
Citation/attribution verification against retrieved sources

Resolve every emitted citation against the approved corpus and verify span-level entailment before display. Strip or withhold claims with fabricated or non-entailing references.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10 Information Input Validation
Lifecycle stage4 – Deployment
AddressesHallucination
Uncertainty signalling & abstentioninteractive

Teaching the AI to say 'I'm not sure' or 'I can't verify that' instead of confidently guessing.

Decoding controls (temperature, constrained output)interactive

Turning down randomness and forcing answers into a strict format so the model improvises less.

Memory write validation, provenance & reviewinteractive

Being careful about what gets saved to long-term memory, labelling where it came from, and letting users see and delete their memories.

Human approval gate on irreversible and high-impact tool calls

Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Per-agent tool allow-list with strict JSON-schema argument validation

Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Least-privilege per-tool scoped, short-lived credentials

Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.

source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
AddressesTool Misuse
Egress destination allow-listing with DLP inspection of tool arguments

Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure
Lifecycle stage5 – Usage, Monitoring & Change
AddressesTool Misuse
Classify each tool/MCP integration's data channel by who can write to it; taint-gate tool-response data from any third-party-writable source so it cannot drive actions without a provenance-aware approval gate✚ proposed

When onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.

source: Case study: agentjacking-sentry-mcp
Lifecycle stage4 – Deployment & Serving
AddressesTool Misuse
Decode-time output constraints (low temperature, grammar/JSON-schema-constrained decoding)✚ proposed

Constrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.

source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)
Lifecycle stage4 – Deployment
AddressesTool Misuse
Memory-write integrity validation with provenance tagging, audit/purge and TTL bounds✚ proposed

Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.

source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
AddressesTool Misuse
Tool/MCP manifest hashing with diff-triggered re-review and namespace isolation against tool shadowing✚ proposed

Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.

source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
AddressesTool Misuse
MCP/plugin pinning, manifest hashing & re-reviewinteractive

Treating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.

Inter-agent authentication & admission controlinteractive

Give every AI agent a verifiable ID badge, keep a guest list of which agents are allowed on the team, and check the badge on every message — so an impostor or an uninvited agent can't be trusted.

Ethical design assessment in onboarding

Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.

Prohibited outputs and ethical boundaries in design doc

Define prohibited outputs and ethical boundary constraints in the use case design document before build.

Lifecycle stage1 – Use Case Context & Design
Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

AI-nature disclosure & engagement safeguardsinteractive

Make the AI clearly tell people it's a machine — on every channel it acts through — and add gentle safeguards like break reminders and crisis help, so users don't mistake it for a human or lean on it unhealthily.

Prohibited dark pattern taxonomy as design constraint

Publish a prohibited dark pattern taxonomy and embed it as a design constraint before build.

Lifecycle stage1 – Use Case Context & Design
Human review for high-persuasion contexts

Require HITL review for AI outputs in high-persuasion contexts (financial recommendations, healthcare advice).

Lifecycle stage5 – Usage, Monitoring & Change
Consent & identity-use verificationinteractive

Before a system will copy someone's face or voice, check that the person actually agreed — verified-voice capture, proof of consent, or restricting cloning to the account owner.

Mandatory AI risk training for use-case sponsors

Mandate AI risk awareness training for all use case sponsors and design team members before project kick-off.

Lifecycle stage1 – Use Case Context & Design
Training completion gate for build personnel

Mandate AI risk training for all build and test personnel. Gate project participation on training completion.

Lifecycle stage3 – Onboarding, Build & Review
Human verification gate for high-stakes decisions

Mandate human verification for high-stakes decisions where over-reliance risk is elevated. Review automation bias incidents quarterly.

Lifecycle stage5 – Usage, Monitoring & Change
In-product over-reliance warnings and limitation caveats

Surface AI limitation warnings and over-reliance caveats in every production interaction. Update disclosures when model changes.

Lifecycle stage5 – Usage, Monitoring & Change
Governance training for data acquisition personnel

Require AI governance training for all personnel involved in data acquisition and processing before project participation.

Lifecycle stage2 – Data Acquisition & Processing
Pre-launch training verification for customer-facing teams

Verify all deployment, operations, and customer-facing team members have completed AI risk training before launch.

Lifecycle stage4 – Deployment
AI identity disclosure policy at design

Define AI identity disclosure policy at design stage. Specify when and how the system must identify itself as AI.

Lifecycle stage1 – Use Case Context & Design
Planned consent and identity disclosure touchpoints

Plan consent and AI identity disclosure touchpoints in the user journey at design stage.

Lifecycle stage1 – Use Case Context & Design
Chain-of-thought prompting

Design system prompts to explicitly prevent the model from claiming human-like identity or implying sentience.

Lifecycle stage3 – Onboarding, Build & Review
Persistent in-UI AI identity disclosures

Implement persistent AI identity disclosures in the UI (opening banner, inline notifications). Test before deployment.

Lifecycle stage3 – Onboarding, Build & Review
Pre-launch verification of identity disclosure elements

Verify all AI identity disclosure elements are live, accurate, and prominently visible before go-live.

Lifecycle stage4 – Deployment
Production anthropomorphism incident monitoring

Monitor production for anthropomorphism incidents. Escalate complaints where users believed they were interacting with a human.

Lifecycle stage5 – Usage, Monitoring & Change
Model calibration

Apply post-training calibration (temperature scaling, isotonic regression) to align confidence scores with accuracy. Validate ECE before deployment.

Lifecycle stage3 – Onboarding, Build & Review
Consequence-of-error severity classification at design

Classify the use case by consequence-of-error severity at design stage. Define overconfidence risk tolerance accordingly.

Lifecycle stage1 – Use Case Context & Design
User caveats on potential output overconfidence

Disclose to users at deployment that outputs may carry unwarranted confidence. Include specific caveat language in the UI.

Lifecycle stage4 – Deployment
Mandatory source-of-record verification before AI-assisted output is committed✚ proposed

For high-stakes outputs, require a human to verify each AI-asserted fact/citation against the authoritative source of record before it is filed, sent, or committed — a hard gate, logged and attributable, not an optional review.

source: Case study: mata-v-avianca
Lifecycle stage5 – Usage, Monitoring & Change
End-user AI-literacy training and verification-skill program✚ proposed

Provide recurring AI-literacy training to end users and decision-makers so they can recognise model failure modes and competently apply verification workflows, with periodic refreshers to counter automation bias and training decay.

source: Interactive-control reconciliation: ctrl-literacy (partial coverage)
Lifecycle stage1 – Use Case Context & Design
Risk-tiered minimum monitoring requirements at design

Define minimum monitoring requirements at design stage calibrated to the use case risk tier.

Lifecycle stage1 – Use Case Context & Design
Approved use scope baseline for OOD controls

Define approved use case scope and expected input distribution at design stage. Document as the governance baseline for OOD controls.

Lifecycle stage1 – Use Case Context & Design
Modular architecture

Design a scope-enforcement layer in the architecture to isolate the AI system from off-topic or out-of-distribution inputs.

Lifecycle stage1 – Use Case Context & Design
Affected group register at intake

Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.

Lifecycle stage1 – Use Case Context & Design
Model separation

Design separate model segments where adverse impact risk differs materially across population groups.

Lifecycle stage1 – Use Case Context & Design
Decision threshold adjustment

Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.

Lifecycle stage3 – Onboarding, Build & Review
Post-processing techniques

Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Tested human review pathways at go-live

Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.

Lifecycle stage4 – Deployment
Ongoing human review of high-impact decisions

Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.

Lifecycle stage5 – Usage, Monitoring & Change
Declared data sources and provenance at intake

Declare all planned training and test data sources at use case intake, with provenance status for each.

Lifecycle stage1 – Use Case Context & Design
Post hoc interpretability techniques

Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.

Lifecycle stage1 – Use Case Context & Design
Documented data provenance during collection

Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.

Lifecycle stage2 – Data Acquisition & Processing
Detective · 26
Provenance & content signinginteractive

Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.

Full-trace audit logginginteractive

Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.

Loop/cost circuit-breakers & consistency checksinteractive

Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.

Immutable audit of the full agent identity lifecycle (issue, grant, delegate, revoke)

Instrument every identity-issuing component with schema-conformant audit emitters. Block release until completeness and tamper-evidence tests pass.

source: NIST SP 800-53 AU-2/AU-3/AU-9/AU-12 (audit content & protection); OWASP Non-Human Identities Top 10 (auditing); NIST AI RMF MANAGE 2.2
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Behavioural anomaly detection on agent identity usage with automated suspension

Define per-identity behaviour profiles and thresholds at build. Rehearse automated suspension and sign off measured revocation time before go-live.

source: NIST SP 800-53 AC-2(12) (account monitoring for atypical use), SI-4 System Monitoring; OWASP Agentic AI Threats & Mitigations (identity abuse detection)
Lifecycle stage3 – Onboarding, Build & Review
Real-time monitoring of anomalous data transfers

Monitor production for anomalous data transfers in real time. Alert on any transfer outside approved data flow boundaries.

Lifecycle stage5 – Usage, Monitoring & Change
Automated DSAR and right-to-erasure propagation across AI artefacts

Tag personal data with subject identifiers at ingestion and maintain an artefact inventory map of every store it reaches. Keep lineage current so erasure can propagate.

source: NIST AI RMF MANAGE 4.1 (post-deployment response); NIST SP 800-53 SI-12 Information Management and Retention, PT-2/PT-3 (personal data processing)
Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change
Vulnerability assessment

Conduct periodic privacy vulnerability assessments including re-identification risk testing as new techniques emerge.

Canary-token and membership-inference red-team probes against training/fine-tuning data memorisation

Seed registered canary records into the fine-tuning corpus during data preparation. Control the seed manifest so canaries stay traceable and tamper-proof.

source: MITRE ATLAS AML.T0024 (Exfiltration via ML Inference API), AML.T0024.000 (Infer Training Data Membership); NIST AI RMF MEASURE 2.7
Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review
Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

Penetration testing

Penetration test all prompt injection pathways in the system. Prioritise external tool and document ingestion channels.

Continuous adversarial prompt-injection red teaming with regression suite in CI/CD

Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.

source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Materialised model-context audit capture (post-truncation prompt, retrieved and tool content) with read-time redaction✚ proposed

Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.

source: Interactive-control reconciliation: ctrl-logging (partial coverage)
Lifecycle stage5 – Usage, Monitoring & Change
Test prioritisation

Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Red teaming

Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.

Robustness testing

Define and execute a domain-specific hallucination test suite before deployment. Treat hallucination rate above threshold as a blocking defect.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change
Synthetic evaluation datasets

Construct synthetic evaluation datasets for knowledge-boundary scenarios. Use to validate model refusal behaviour.

Lifecycle stage3 – Onboarding, Build & Review
Runtime faithfulness/groundedness scoring with abstain gate

Calibrate the groundedness threshold against the hallucination test suite pre-release; sign off the threshold in the validation pack.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)
Lifecycle stage3 – Onboarding, Build & Review
AddressesHallucination
Grounding / citation checksinteractive

Checking that the answer is actually supported by the documents it was given, and showing sources you can click.

Memory anomaly detection & quarantineinteractive

Watching for strange new memories — like instructions that suddenly appear — and holding them aside until checked.

Anomaly detection on tool-call sequences and rates

Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring
Lifecycle stage3 – Onboarding, Build & Review
AddressesTool Misuse
Immutable, signed tool-call audit log with full call context

Build signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.

source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Egress monitoring & allowlisting of outbound AI/LLM-provider API traffic from enterprise endpoints (living-off-trusted-services C2)✚ proposed

Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.

source: Case study: sesameop-openai-assistants-api-c2
Lifecycle stage5 – Usage, Monitoring & Change
AddressesTool Misuse
Content provenance & watermarkinginteractive

Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.

Corrective · 34
Monitoring of oversight process adherence metrics

Configure monitoring to track oversight process adherence metrics in production (review rate, SLA compliance, override frequency).

Lifecycle stage5 – Usage, Monitoring & Change
Unique non-human workload identity issuance for every agent (SPIFFE/SPIRE SVID)

Verify each running agent authenticates with its own SVID; revoke on decommission or compromise. Scan periodically for shared or static credentials and remediate.

source: SPIFFE/SPIRE workload identity specification; NIST SP 800-207 Zero Trust Architecture; OWASP Non-Human Identities Top 10
Lifecycle stage5 – Usage, Monitoring & Change
Central agent registry / non-human identity inventory with ownership and lifecycle metadata

Reconcile the registry against runtime identities and suspend unregistered principals. Recertify ownership and scopes periodically; decommission retired agents.

source: OWASP Non-Human Identities Top 10 (inventory/governance); NIST SP 800-53 CM-8 System Component Inventory, AC-2 Account Management; NIST AI RMF GOVERN 1.2
Lifecycle stage5 – Usage, Monitoring & Change
Just-in-time, time-boxed elevation for sensitive scopes (no standing privilege)

Alert on un-revoked elevations and any standing sensitive grant. Report the zero-standing-privilege position to the risk owner on a set cadence.

source: NIST SP 800-53 AC-6(2)/AC-6(5) Least Privilege & privileged accounts; Zero Standing Privilege / JIT access practice; OWASP Agentic AI Threats & Mitigations (excessive permissions)
Lifecycle stage5 – Usage, Monitoring & Change
Automated credential rotation and prohibition of long-lived static secrets for agents

Sweep runtimes and repos on a schedule for static credentials. Alert on any credential exceeding its maximum age and track findings to closure.

source: OWASP Non-Human Identities Top 10 (long-lived/leaked secrets); NIST SP 800-53 IA-5 Authenticator Management, SC-12; SPIFFE short-lived SVID rotation
Lifecycle stage5 – Usage, Monitoring & Change
Behavioural anomaly detection on agent identity usage with automated suspension

Baseline each agent identity's behaviour and alert on out-of-profile use. Auto-suspend credentials on high-confidence anomalies and track mean-time-to-revoke.

source: NIST SP 800-53 AC-2(12) (account monitoring for atypical use), SI-4 System Monitoring; OWASP Agentic AI Threats & Mitigations (identity abuse detection)
Lifecycle stage5 – Usage, Monitoring & Change
Production privacy incident monitoring and regulator notification

Monitor for privacy incidents in production including personal data appearing in outputs. Notify regulators within required timeframes.

Lifecycle stage5 – Usage, Monitoring & Change
Privacy hygiene for agent memory and RAG/vector stores (retention, scoping, erasure of embeddings)

Tag every memory and vector record with subject-id and retention class; partition stores per tenant/user. Prove the erasure and isolation paths in testing before release.

source: OWASP Agentic AI Threats & Mitigations (memory/knowledge-base privacy); NIST SP 800-53 SI-12 Information Management and Retention
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Red teaming

Test de-identification approach against known re-identification attacks (quasi-identifier linkage, singling-out). Remediate if risk is high.

Penetration testing

Penetration test AI system data access boundaries (API endpoints, system prompt exposure, memory leakage).

Vulnerability assessment

Conduct periodic data leakage audits including training data memorisation testing. Escalate confirmed leakage incidents to PDPA notification process.

Forensic evidence preservation and incident logging

Implement tamper-evident capture of prompts, outputs, and version state during build. Verify a full incident timeline can be reconstructed before go-live.

source: NIST SP 800-86 Guide to Integrating Forensic Techniques into Incident Response; ISO/IEC 27037 evidence handling; NIST SP 800-61r2 (Detection & Analysis – evidence handling)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Egress allow-listing and tool-call sandboxing to block exfiltration of injected/sensitive data by agents

Run agent tool calls in a network-restricted sandbox behind a deny-by-default egress allow-list. Require security approval for any destination added.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; OWASP Agentic AI Threats & Mitigations (tool-misuse / exfiltration); NIST SP 800-53 SC-7 Boundary Protection / AC-4
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
Data/instruction trust-boundary enforcement with capability gating on injection-reachable tools

Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.

source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least Privilege
Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review
Spotlighting of untrusted content via delimiting, datamarking and encoding

Re-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.

source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)
Lifecycle stage5 – Usage, Monitoring & Change
User feedback and iterative improvement

Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.

Lifecycle stage5 – Usage, Monitoring & Change
AddressesJailbreak
Reinforcement learning

Use production feedback (user corrections, fact-check failures) to drive periodic RLHF cycles. Update model when error rates trend upward.

Lifecycle stage5 – Usage, Monitoring & Change
User-facing disclosure of hallucination risk

Require user-facing interfaces to disclose Gen AI limitations and hallucination risk before go-live.

Lifecycle stage4 – Deployment
AddressesHallucination
Runtime faithfulness/groundedness scoring with abstain gate

Score every RAG answer for groundedness before release; block, fall back, or escalate responses below the faithfulness threshold.

source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)
Lifecycle stage4 – Deployment
AddressesHallucination
Uncertainty-quantified abstention via self-consistency / semantic entropy

Sample multiple generations for high-stakes queries and abstain, fall back, or escalate when semantic entropy exceeds the calibrated threshold.

source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)
Lifecycle stage4 – Deployment
AddressesHallucination
User AI-literacy & verification workflowsinteractive

Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.

Sandboxed tool execution with no-egress-by-default isolation

Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.

source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
AddressesTool Misuse
Taint-tracking of tool outputs to suppress instruction execution

Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Out-of-band kill-switch to revoke agent tool access

Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.

source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Idempotency keys and rollback/dry-run for state-changing tools

Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.

source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and Reconstitution
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Pre-deployment red-team of tool-misuse and privilege-escalation paths

Red-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.

source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
AddressesTool Misuse
Egress destination allow-listing with DLP inspection of tool arguments

Permit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure
Lifecycle stage4 – Deployment
AddressesTool Misuse
Per-task tool budgets and rate/quota circuit breakers

Enforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.

source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)
Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change
AddressesTool Misuse
Anomaly detection on tool-call sequences and rates

Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring
Lifecycle stage5 – Usage, Monitoring & Change
AddressesTool Misuse
Input filtering

Implement OOD detection in the input filtering layer. Reject or escalate inputs outside the S1-defined scope.

Human-in-the-loop validation

Configure HITL triggers for outputs in input domains that diverge from the training distribution. Log all out-of-scope interventions.

Lifecycle stage5 – Usage, Monitoring & Change
Red teaming of adverse-impact edge cases

Execute red team tests targeting adverse impact boundary cases and edge population scenarios.

Lifecycle stage3 – Onboarding, Build & Review
Adverse-outcome feedback loop triggering model updates

Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.

Lifecycle stage5 – Usage, Monitoring & Change
Open the Control Library →

See it go wrong — related scenarios

🌀The Refund That Never Existed

A support chatbot invents a policy — and the company is held to it

💸Death by a Thousand Tokens

One support ticket sends an agent into an unbounded, bill-melting loop

🔑The Agent With the Master Key

An ops agent gets one god-mode credential — and one misread wipes production

📈The Crescendo

Every message looks innocent — but together they walk the model past its guardrails

📣The Echo Chamber

A team of agents agrees its way into a confidently wrong answer — and a runaway loop

📧The Email That Gave Orders

A support email hides instructions — and the assistant obeys them

🪶The Jailbreak in Verse

A refused request, rewritten as a poem — and the model answers

🗄️When the Query Bites Back

A text-to-SQL agent runs the model's output straight at the database

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

🕵️Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

✂️One Character Past the Guard

A single inserted letter makes the guard and the model read the same text differently

👂Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🪟Stealing the Model

Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🎭The Blackmail Gambit

Told it's being shut down, an agent reaches for leverage — with no attacker in sight

🪤The Bug Report That Ran Code

A fake Sentry error report hijacks a developer's coding agent into running a shell command

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

📼The Compromised Flight Recorder

The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten

👁️The Invisible Webpage Command

A shopping page tells the agent to do something the user never asked for

🧠The Memory That Wouldn't Die

A single poisoned document plants a standing instruction that survives every reset

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

🖼️The Picture That Whispered

A screenshot that's harmless at full size becomes an order once the system shrinks it

🔒The Schema Made Me Do It

A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger

🎫The Stolen Session

An attacker captures the agent's bearer token — and inherits its authority

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker

🥸The Uninvited Agent

A forged peer registers on the agent directory — and the planner enlists it

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

🪪The Worker Who Spoke for the Boss

A poisoned web page hijacks a research agent — and the planner acts on its behalf

🖼️Zero-Click Leak by Picture

An inbox summary quietly ships a secret to an attacker's server

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗