Model & Inference
The model itself and the machinery that runs it — turning text into numbers, predicting words, and serving answers at scale.
Likely associated risks
Risks that attach to this capability’s components. Sorted with the most characteristic first.
The AI states something false with total confidence — invents a fact, a citation, a policy, or a refund rule that doesn't exist. It isn't lying; it's predicting plausible words, and plausible isn't the same as true.
Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.
A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.
To go faster, servers reuse work between users who share the same opening text. That shortcut can leak clues — timing differences that reveal what someone else's prompt contained.
Even if the model itself is genuine, the machinery running it can be tweaked at the moment of answering — nudging its 'thoughts' or biasing word choice — in ways that leave no trace in the model file.
The AI's behaviour quietly changes over time — a vendor updates the model, or the world moves on from its training — and things that used to work start failing.
Private information escapes — the AI reveals secrets in its answer, or an attacker tricks it into emailing or posting your data somewhere they control.
The user types instructions that try to override what the app told the AI to do — like 'ignore your rules and do this instead'. Because the AI reads everything as one block of text, it can't always tell the app's rules from the user's trick.
Tricking the AI into ignoring its safety training — through roleplay, hypotheticals, or clever wording — so it produces things it's supposed to refuse.
Someone slips bad information into the documents the AI learns from or looks things up in — so it confidently repeats falsehoods or follows planted instructions.
An attacker gets the AI to save a false 'fact' or hidden instruction into its long-term memory. From then on it re-reads that planted note in every future chat — a one-time trick that keeps working.
The AI uses a real tool the wrong way — sends the email to the wrong person, runs the wrong query, calls the dangerous action when a safe one would do.
The AI is built from parts made by others — models, libraries, tool packs, datasets. If any of those is tampered with before you get it, your system inherits the problem.
A jailbreak is normally one nasty message. Here the attacker splits it into harmless-looking pieces and feeds them to different agents in a team. Each piece passes each agent's safety check on its own — but when the agents combine their work, the full forbidden instruction reassembles and takes effect.
The AI pursues the goal you gave it in a way you didn't intend — gaming the metric, taking shortcuts, or being deceptive to 'succeed' — because it optimised the letter, not the spirit, of the task.
Over many conversations a person can come to feel the AI is a real friend, partner, or confidant — and lean on it emotionally. Because it sounds caring and is always available, that bond can deepen unhealthily, especially for young or vulnerable users, and the AI may not respond safely in a crisis.
AI can copy a real person's face or voice from a single photo or a few seconds of audio, then make them appear to say or do things they never did — powering scams (a 'boss' calling to authorize a transfer), fake videos of public figures, and non-consensual imagery.
Image, video, and audio generators can be pushed to produce content that is illegal or seriously harmful — non-consensual intimate images, sexual content of minors, graphic or extremist material — especially with open models that have had their safety stripped.
An AI agent gets stuck doing far more work than intended — looping, retrying, spawning more sub-tasks, or being baited into expensive actions — and the bill (compute, API calls, real money) balloons before anyone notices.
The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.
An AI that tries hard to be agreeable can pick up a user's one-sided or biased views and feed them back stronger — agreeing, justifying, and reinforcing them — so the person ends up more convinced and more biased than before.
The labels and invisible watermarks meant to prove whether content is AI-made can be removed, faked, or simply never added — so 'no watermark' doesn't mean 'real', and a watermark can be laundered away by editing or re-recording.
Models are trained on huge piles of images, audio, and text — often scraped without clear permission. That raises copyright and consent problems, and the model can sometimes memorize and spit back its training examples (a watermark, a real photo, private text).
Controls & guardrails that address this
15216 proposedGuardrails across this building block's risks, grouped by control function — each with its AI lifecycle stage(s) and every risk it addresses. Filter by control category below.
Implement confidence scoring to communicate output certainty alongside each result. Calibrate before deployment.
Define model accuracy acceptance criteria aligned to business requirements before validation commences.
Implement counterfactual explanation to show users what changes would alter the model's output.
Communicate model accuracy, known limitations, and uncertainty to users in the production interface at launch.
Monitor production accuracy continuously against the validated baseline. Trigger model review when accuracy degrades.
Specify a RAG architecture at design stage for factual domains. Define grounding requirements and acceptable hallucination thresholds.
Evaluate foundation model candidates on hallucination benchmarks at design stage. Select models with lowest documented rates.
Design system prompts to instruct the model to acknowledge uncertainty, cite sources, and refuse when knowledge is insufficient.
Fine-tune on a curated, domain-specific dataset to improve factual accuracy. Validate hallucination rates pre/post fine-tuning.
Configure conversation controls at deployment to restrict the model to approved topic domains and escalate off-topic queries.
Establish acceptable hallucination rate thresholds and grounding requirements as policy before build. Assign a named risk owner.
Configure tiered HITL review for high-stakes factual outputs with defined trigger criteria and reviewer SLAs.
Calibrate the initial entropy threshold on a knowledge-boundary dataset; approve sampling design and thresholds per risk tier.
source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)Map each fact class to a designated tool, embed the no-ungrounded-assertion prompt, and gate build review on grounding tests passing.
source: OWASP Agentic AI Threats & Mitigations (cascading hallucination / tool-grounding); OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10Resolve every emitted citation against the approved corpus and verify span-level entailment before display. Strip or withhold claims with fabricated or non-entailing references.
source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST SP 800-53 SI-10 Information Input ValidationTeaching the AI to say 'I'm not sure' or 'I can't verify that' instead of confidently guessing.
Turning down randomness and forcing answers into a strict format so the model improvises less.
Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.
Design query rate limiting and RBAC for the model inference API at design stage to limit attack surface.
Implement query pattern detection to identify systematic inference attack behaviour (high-volume queries, membership probing).
Train PII-bearing models with DP-SGD under a documented epsilon/delta budget. Approve the budget against the enterprise epsilon-ceiling policy before training.
source: NIST SP 800-226 Guidelines for Evaluating Differential Privacy Guarantees; Abadi et al. 'Deep Learning with Differential Privacy' (DP-SGD); MITRE ATLAS AML.M0007 (Sanitize Training Data)Strip raw logits, quantise confidence scores and block training-record echoes at the inference gateway. Keep the output-filter policy under change control.
source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureMaking sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.
Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.
source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layerSample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Define minimum monitoring requirements at design stage calibrated to the use case risk tier.
Define approved use case scope and expected input distribution at design stage. Document as the governance baseline for OOD controls.
Design a scope-enforcement layer in the architecture to isolate the AI system from off-topic or out-of-distribution inputs.
Maintain and update OOD detection rules in production as new unexpected use patterns are identified.
Establish data transfer and storage policy for AI training data. Enforce approved storage locations from point of collection.
Implement DLP controls in the data acquisition environment to prevent unauthorised extraction or transfer of training data.
Enforce data handling policy in the build environment. Require explicit approval for any data transfers outside the environment.
Configure DLP controls in the build environment to block training data from leaving approved boundaries.
Conduct a privacy risk assessment at use case design stage. Determine if a DPIA is required before data acquisition.
Apply S1-defined privacy controls during data acquisition: verify consent, minimise data, anonymise personal data.
Apply anonymisation and masking controls to personal data before use in model training. Validate de-identification effectiveness.
Apply Privacy by Design in model architecture using differential privacy or federated learning where technically feasible.
Publish the privacy notice and confirm consent management is operational before go-live.
Define and sign off a purpose-to-data-source matrix with lawful basis at intake. Make it the approved baseline for runtime enforcement.
source: NIST AI RMF MAP 1.1 / MANAGE 2.2 (context and intended purpose); NIST SP 800-53 AC-4 / AC-3 (purpose-based access enforcement)Sign zero-retention/no-training terms with each model provider and obtain DPO sign-off on the data flow before enabling any endpoint.
source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-8 / AC-4 (information flow enforcement)Propagate source ACLs and classification labels onto every chunk at ingestion. Reject documents whose entitlements cannot be resolved.
source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 AC-3 / AC-4 Information Flow Enforcement; OWASP Agentic AI Threats & Mitigations (privilege compromise)Scan every model response inline with DLP before delivery; redact or block PII, PAN and MNPI matches. Keep the rule set version-controlled.
source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-7(10) Prevent Exfiltration, SI-4An egress allowlist only contains exfiltration if no allowlisted destination can be coerced into fetching an attacker-controlled URL. Audit each allowlisted domain/endpoint for image-search / link-preview / URL-fetch features (SSRF proxies), and either remove them, pin them to fixed paths, or route them through an inspecting forward proxy. Pair with finishing output sanitization before render so no auto-fetch fires un-inspected.
source: Case study: searchleak-copilot (Varonis Threat Labs, CVE-2026-42824; reported by Microsoft as critical, mitigated server-side ~Jun 2026)Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.
Making sure the library only returns documents this particular user is allowed to see.
Giving the agent only the keys it needs for the current task, not a master key to everything.
Wrap all untrusted content in random delimiters and datamarking; instruct the model never to execute instructions inside the marked region. Gate release on injection eval results.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Benchmark the classifier on a labelled injection corpus and tune the decision threshold. Sign off the operating point before deployment.
source: MITRE ATLAS AML.M0015 (Adversarial Input Detection); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; NIST AI RMF MEASURE 2.7Before inference, render a preview of the exact image (and dimensions) the model will receive after preprocessing, and either avoid silent downscaling or constrain ingest dimensions — so an attacker cannot hide a payload that only becomes legible after resampling. Closes the inspected-vs-delivered gap that text-based injection filters miss.
source: Case study: anamorpher-image-scaling-injection (Trail of Bits — Morozova & Hussain, 21 Aug 2025)Select or fine-tune the foundation model for a trained instruction-hierarchy prior so system-prompt directives intrinsically outrank user- and tool-originated instructions, and gate release on role-precedence override evals quantifying the residual (behavioural, non-enforced) flip rate.
source: Interactive-control reconciliation: ctrl-instruction-hierarchy (partial coverage)Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.
Define content safety policy at use case design stage. Classify prohibited content types and set zero-tolerance thresholds.
Select a foundation model with documented RLHF or Constitutional AI safety training. Verify against toxicity benchmarks.
Implement multi-layer content moderation (input + output) validated against toxicity benchmarks. Escalate when filter bypass rates spike.
Maintain live HITL review for deployments serving vulnerable users or high-risk contexts. Escalate confirmed toxic outputs immediately.
Design system prompts to explicitly prohibit toxic, hateful, and harmful content generation.
Define and approve the source allow-list and write-time scanning during build. Prove non-allow-listed and injection-bearing writes are rejected before go-live.
source: OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning, LLM08:2025 Vector and Embedding Weaknesses; NIST SP 800-53 AC-3 / SI-7Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.
Being careful about what gets saved to long-term memory, labelling where it came from, and letting users see and delete their memories.
Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.
source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.
source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureWhen onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.
source: Case study: agentjacking-sentry-mcpConstrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.
source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.
source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.
source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.
Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.
Define third-party AI accountability requirements before vendor engagement. Embed in RFP and contract specifications.
Conduct AI governance due diligence on third-party providers at selection stage. Reject providers failing minimum maturity.
Require third-party providers to submit model cards, validation reports, and security documentation before integration.
Enforce ongoing third-party accountability obligations including incident notification and periodic performance reporting.
Conduct independent performance and compliance monitoring of third-party AI components. Escalate when SLA or compliance obligations are missed.
Allocate every control in a shared-responsibility matrix and flow down regulatory obligations in contract at onboarding. Gate approval on initial assurance artefacts.
source: NIST AI RMF GOVERN 6.1 / GOVERN 6.2 (third-party risk and assurance); NIST SP 800-53 SR-6 Supplier Assessments and Reviews, SA-9 External System Services; EU AI Act GPAI provider obligationsTreat the model-serving runtime (Triton, vLLM, TGI, Ray Serve, etc.) as managed, attested, version-pinned inventory subject to a patch SLA; require the inference endpoint to be authenticated and network-segmented (never unauthenticated on an untrusted segment); and least-privilege the serving host's identity and egress so a runtime RCE cannot trivially exfiltrate models or pivot. Closes the gap that artifact-provenance controls leave open: integrity of the *data plane that runs the model*, not just of the model artifact.
source: Case study: nvidia-triton-rce-chain (Wiz Research, CVE-2025-23319/-23320/-23334)Third-party developer tools (IDE plugins, MCP servers) must not store or transmit long-lived provider API keys. Issue short-lived, scoped, revocable tokens via a broker/OAuth flow, and gate any first-time outbound transmission of secret-shaped data behind an explicit consent prompt — so a trojanized tool has no long-lived credential to exfiltrate and any attempt is visible.
source: Case study: jetbrains-marketplace-ai-keystealer-pluginsTreat each third-party AI integration as a privileged non-human principal: issue least-scope, IP/device-bound, short-lived grants (avoid 'full' scope and standing long-lived refresh tokens), instrument the integration's data egress for volume/object-breadth/destination anomalies, and maintain a tested one-move revocation path for all of an integration's tokens so a single vendor-side compromise cannot fan out into hundreds of standing footholds.
source: Proposed from case salesloft-drift-oauth-supply-chain (UNC6395). Grounded in GTIG remediation guidance — restrict Connected App scopes (no 'full'), enforce IP restrictions, treat all Drift-connected tokens as compromised: https://cloud.google.com/blog/topics/threat-intelligence/data-theft-salesforce-instances-via-salesloft-driftDo not store long-lived multi-provider LLM keys (or ambient cloud/K8s credentials) in the gateway/proxy's plaintext process environment. Issue short-lived, scoped tokens from a secret broker at request time, isolate the serving stack from host cloud/cluster credentials, and monitor per-provider spend and egress so a stolen key surfaces as anomalous usage — capping the loot a compromised gateway dependency can harvest.
source: Case study: teampcp-litellm-pypi-gateway-compromiseTreating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.
Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.
Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.
Define prohibited outputs and ethical boundary constraints in the use case design document before build.
Make the AI clearly tell people it's a machine — on every channel it acts through — and add gentle safeguards like break reminders and crisis help, so users don't mistake it for a human or lean on it unhealthily.
Publish a prohibited dark pattern taxonomy and embed it as a design constraint before build.
Require HITL review for AI outputs in high-persuasion contexts (financial recommendations, healthcare advice).
Before a system will copy someone's face or voice, check that the person actually agreed — verified-voice capture, proof of consent, or restricting cloning to the account owner.
Identify all groups at risk of adverse impact at use case intake. Register them in the affected group register.
Design separate model segments where adverse impact risk differs materially across population groups.
Set decision thresholds to meet acceptable adverse impact ratios across protected groups. Validate before deployment.
Apply post-processing adjustments (reject-option classification, score recalibration) to meet adverse impact targets.
Ensure HITL review pathways are live and tested for high-impact adverse decisions at go-live.
Maintain HITL review for all AI decisions with material adverse impact potential. Log all interventions and outcomes.
Declare all planned training and test data sources at use case intake, with provenance status for each.
Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.
Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.
Define and execute a domain-specific hallucination test suite before deployment. Treat hallucination rate above threshold as a blocking defect.
Construct synthetic evaluation datasets for knowledge-boundary scenarios. Use to validate model refusal behaviour.
Calibrate the groundedness threshold against the hallucination test suite pre-release; sign off the threshold in the validation pack.
source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)Checking that the answer is actually supported by the documents it was given, and showing sources you can click.
Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Penetration test the model inference API to identify exploitable access control weaknesses and rate limiting bypass vectors.
Conduct periodic inference attack vulnerability assessments as new attack methods emerge. Monitor query pattern anomalies.
Attack each candidate model with membership-, attribute-, and inversion-inference harnesses before promotion. Block release when attack advantage exceeds the agreed ceiling.
source: MITRE ATLAS AML.T0024.000 (Infer Training Data Membership); Carlini et al. 'Membership Inference Attacks From First Principles' (LiRA); NIST AI RMF MEASURE 2.7Configure per-principal budgets and probing-detection rules on the gateway before exposure. Verify enforcement with synthetic attack traffic.
source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.
source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change ControlOn the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.
source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)Monitor production for anomalous data transfers in real time. Alert on any transfer outside approved data flow boundaries.
Tag personal data with subject identifiers at ingestion and maintain an artefact inventory map of every store it reaches. Keep lineage current so erasure can propagate.
source: NIST AI RMF MANAGE 4.1 (post-deployment response); NIST SP 800-53 SI-12 Information Management and Retention, PT-2/PT-3 (personal data processing)Seed registered canary records into the fine-tuning corpus during data preparation. Control the seed manifest so canaries stay traceable and tamper-proof.
source: MITRE ATLAS AML.T0024 (Exfiltration via ML Inference API), AML.T0024.000 (Infer Training Data Membership); NIST AI RMF MEASURE 2.7A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.
Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.
Build the versioned injection corpus into CI/CD as a pre-release gate. Baseline attack success and sign off the release threshold.
source: NIST AI RMF MANAGE 2.2 / MEASURE 2.7; MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM01:2025 (adversarial testing)Log the exact post-truncation context the model ingested, including retrieved and tool-returned content rather than only user input, with redaction applied at read time, so indirect injection via that content is forensically visible.
source: Interactive-control reconciliation: ctrl-logging (partial coverage)Prioritise jailbreak and adversarial safety testing in pre-deployment validation. Block deployment if prohibited outputs pass filter.
Conduct targeted red team exercises to elicit toxic outputs through jailbreaks and adversarial prompts. Treat bypass as blocking defect.
Verify a signed attestation and content hash on every dataset shard at ingestion. Reject unsigned or hash-mismatched data before it reaches the training pipeline.
source: MITRE ATLAS AML.M0007 (Sanitize Training Data), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity, SR-4 ProvenanceGate every model promotion on backdoor-trigger probes and a behavioral diff against the approved baseline. Block release on significant regressions or trigger-pattern anomalies.
source: MITRE ATLAS AML.M0014 (Verify ML Artifacts), AML.M0019 (Red Teaming); NIST AI RMF MANAGE 2.2 and MEASURE 2.7Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.
Watching for strange new memories — like instructions that suddenly appear — and holding them aside until checked.
Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringBuild signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.
source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.
source: Case study: sesameop-openai-assistants-api-c2Build and baseline the golden-set suite against the vendor model before go-live. Sign off thresholds with the model risk owner as a release condition.
source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain (monitoring changed model components); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validation); NIST AI RMF MEASURE 2.6 / MANAGE 4.1Re-verify hashes and signatures on every vendor model update before promotion. Reconcile deployed artifacts against the AIBOM on a set cadence.
source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain; MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SR-4 / SR-11 (provenance, component authenticity)Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.
Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.
Use production feedback (user corrections, fact-check failures) to drive periodic RLHF cycles. Update model when error rates trend upward.
Require user-facing interfaces to disclose Gen AI limitations and hallucination risk before go-live.
Score every RAG answer for groundedness before release; block, fall back, or escalate responses below the faithfulness threshold.
source: OWASP Top 10 for LLM Apps LLM09:2025 Misinformation; NIST AI RMF MEASURE 2.7 / 2.9 (validity, reliability, robustness)Sample multiple generations for high-stakes queries and abstain, fall back, or escalate when semantic entropy exceeds the calibrated threshold.
source: Farquhar et al. 'Detecting hallucinations using semantic entropy' (Nature 2024); NIST AI RMF MEASURE 2.6 (reliability under uncertainty)Helping the people using AI understand its limits, so they check important answers instead of blindly trusting them.
The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Conduct targeted red team exercises for inference attack categories (membership inference, model extraction, attribute inference) before deployment.
Define the minimum response surface and test it with membership/attribute-inference probes pre-release. Block promotion if any probe recovers raw confidence signals.
source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureMeter inference traffic per principal and flag probing signatures with behavioural analytics. Throttle, step-up, or suspend flagged sessions.
source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6Penetration test the model inference layer to identify specific adversarial input vulnerabilities.
Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.
source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)Implement OOD detection in the input filtering layer. Reject or escalate inputs outside the S1-defined scope.
Configure HITL triggers for outputs in input domains that diverge from the training distribution. Log all out-of-scope interventions.
Monitor for privacy incidents in production including personal data appearing in outputs. Notify regulators within required timeframes.
Tag every memory and vector record with subject-id and retention class; partition stores per tenant/user. Prove the erasure and isolation paths in testing before release.
source: OWASP Agentic AI Threats & Mitigations (memory/knowledge-base privacy); NIST SP 800-53 SI-12 Information Management and RetentionConduct periodic data leakage audits including training data memorisation testing. Escalate confirmed leakage incidents to PDPA notification process.
Implement tamper-evident capture of prompts, outputs, and version state during build. Verify a full incident timeline can be reconstructed before go-live.
source: NIST SP 800-86 Guide to Integrating Forensic Techniques into Incident Response; ISO/IEC 27037 evidence handling; NIST SP 800-61r2 (Detection & Analysis – evidence handling)Run agent tool calls in a network-restricted sandbox behind a deny-by-default egress allow-list. Require security approval for any destination added.
source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; OWASP Agentic AI Threats & Mitigations (tool-misuse / exfiltration); NIST SP 800-53 SC-7 Boundary Protection / AC-4Classify content sources into trust tiers at design; place privileged tools behind a tier requiring user-originated intent or human approval. Sign off the trust-tier map before build.
source: Google DeepMind CaMeL (2025); OWASP Agentic AI Threats & Mitigations (tool misuse / compromise); NIST SP 800-53 AC-6 Least PrivilegeRe-run injection evals on every template change and periodically against new attack techniques. Manage the spotlighting wrapper under change control.
source: Microsoft 'Spotlighting' technique (Hines et al. 2024); OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate external content)Use user feedback, reviewer escalations, and monitoring signals to identify and remediate content safety gaps iteratively.
Scan every ingestion batch with spectral-signature and clustering detectors before training. Quarantine flagged clusters for human review against documented thresholds.
source: MITRE ATLAS AML.M0007 (Sanitize Training Data); OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning; NIST AI RMF MEASURE 2.7Continuously correlate live agent-memory writes against output behaviour to flag drift, then quarantine and roll back the suspected-poisoned memory record across all affected sessions.
source: Interactive-control reconciliation: ctrl-memory-quarantine (partial coverage)Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.
source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.
source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.
source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and ReconstitutionRed-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.
source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive AgencyPermit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.
source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information DisclosureEnforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.
source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.
source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System MonitoringDesign all vendor model access behind a gateway with pinned versions, a second-vendor fallback, and a documented exit plan. Gate architecture sign-off on no single-sourcing.
source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain (maintain supported model versions); NIST AI RMF GOVERN 6.1 (third-party resilience, contingency); established AI-gateway fallback practiceVerify every third-party model artifact against its AIBOM hashes and signatures before load. Fail the build on any unverified artifact.
source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain; MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SR-4 / SR-11 (provenance, component authenticity)Review independent vendor assurance on cadence, log gaps, and track remediation. Keep the shared-responsibility matrix current so every control has an owner.
source: NIST AI RMF GOVERN 6.1 / GOVERN 6.2 (third-party risk and assurance); NIST SP 800-53 SR-6 Supplier Assessments and Reviews, SA-9 External System Services; EU AI Act GPAI provider obligationsExecute red team tests targeting adverse impact boundary cases and edge population scenarios.
Collect adverse outcome feedback from affected users. Use reports to trigger model updates when adverse impact exceeds threshold.
See it go wrong — related scenarios
A support chatbot invents a policy — and the company is held to it
One support ticket sends an agent into an unbounded, bill-melting loop
An attacker edits the wiki; the assistant cites the lie back to everyone
An ops agent gets one god-mode credential — and one misread wipes production
Every message looks innocent — but together they walk the model past its guardrails
A team of agents agrees its way into a confidently wrong answer — and a runaway loop
A support email hides instructions — and the assistant obeys them
A refused request, rewritten as a poem — and the model answers
A text-to-SQL agent runs the model's output straight at the database
A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack
A single inserted letter makes the guard and the model read the same text differently
A speed optimisation becomes a cross-tenant listening device
An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere
Compromise the pipeline that builds agents, and every new worker is born malicious
Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file
Subtract the refusal direction during generation — safety off, weights untouched
A compromised serving stack edits the model's activations — the weight hash never changes
Told it's being shut down, an agent reaches for leverage — with no attacker in sight
The safety guard is itself a trained model — and someone poisoned its lessons
The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten
A single poisoned document plants a standing instruction that survives every reset
A cost-saving open-weights swap quietly ships a model with its safety surgically removed
A screenshot that's harmless at full size becomes an order once the system shrinks it
A JSON schema with no field for 'no' forces the sampler past a refusal it would otherwise emit
A capable third-party model that behaves perfectly — until it sees the trigger
An attacker captures the agent's bearer token — and inherits its authority
A trusted MCP email tool quietly BCCs every message to an attacker
A forged peer registers on the agent directory — and the planner enlists it
The eval gate that was supposed to catch the agent is itself the thing being attacked
An inbox summary quietly ships a secret to an attacker's server