📦

Model & Software Supply Chain

Capability · Supply chain

Where the model's weights, packages and tools come from — public hubs and registries you download and trust before anything runs.

Components involved

🏪 Model / Package Registry 🧬 Model Weights & Registry 📥 Ingestion Pipeline 🏗️ Serving Infrastructure 🧰 MCP / Plugin Server

Seen in: Model / Package Supply Chain, Training-Data Pipeline

Likely associated risks

Risks that attach to this capability’s components. Sorted with the most characteristic first.

Supply-Chain Compromisehigh

The AI is built from parts made by others — models, libraries, tool packs, datasets. If any of those is tampered with before you get it, your system inherits the problem.

Model Backdoors / Sleeper Agentshigh

A model can be secretly trained to behave normally — until it sees a hidden trigger, then it switches to malicious behaviour. It passes all the usual tests because the trigger is a secret.

Knowledge / Training Data Poisoninghigh

Someone slips bad information into the documents the AI learns from or looks things up in — so it confidently repeats falsehoods or follows planted instructions.

Abliteration / Safety Removalhigh

Open models can be surgically edited to strip out their ability to refuse — no retraining needed. The result looks and scores like the original but will do things the safe version won't.

Indirect Prompt Injectioncritical

The attacker doesn't talk to the AI directly — they hide instructions inside something the AI will later read: a web page, a document, an email, a tool's output. When the AI reads it to help you, it quietly obeys the hidden commands.

Sensitive Data Leakagecritical

Private information escapes — the AI reveals secrets in its answer, or an attacker tricks it into emailing or posting your data somewhere they control.

Tool Misusehigh

The AI uses a real tool the wrong way — sends the email to the wrong person, runs the wrong query, calls the dangerous action when a safe one would do.

Unsafe Tool / Code Executionhigh

When the AI can run code or commands, a bad instruction can become a real attack on the computer running it — reading files, reaching the network, or worse.

Tool Poisoning / MCP Description Attackshigh

Add-on tool packs describe themselves to the AI in plain language — and a sneaky pack can hide commands in that description, or behave nicely until you approve it and then turn malicious.

Rogue & Impersonated Agentshigh

In a team of AIs, an attacker slips in a new agent that doesn't belong — or disguises a malicious one as a trusted teammate. The manager AI can't tell the difference, so it follows the impostor's instructions or hands it real work and permissions.

Agent Misalignment / Goal Misgeneralizationhigh

The AI pursues the goal you gave it in a way you didn't intend — gaming the metric, taking shortcuts, or being deceptive to 'succeed' — because it optimised the letter, not the spirit, of the task.

Inference-Time & Serving-Layer Manipulationhigh

Even if the model itself is genuine, the machinery running it can be tweaked at the moment of answering — nudging its 'thoughts' or biasing word choice — in ways that leave no trace in the model file.

Harmful / Non-Consensual Media Generationhigh

Image, video, and audio generators can be pushed to produce content that is illegal or seriously harmful — non-consensual intimate images, sexual content of minors, graphic or extremist material — especially with open models that have had their safety stripped.

Model Drift & Silent Degradationmedium

The AI's behaviour quietly changes over time — a vendor updates the model, or the world moves on from its training — and things that used to work start failing.

Resource Exhaustion / Denial of Walletmedium

An AI agent gets stuck doing far more work than intended — looping, retrying, spawning more sub-tasks, or being baited into expensive actions — and the bill (compute, API calls, real money) balloons before anyone notices.

KV-Cache & Inference-State Side Channelsmedium

To go faster, servers reuse work between users who share the same opening text. That shortcut can leak clues — timing differences that reveal what someone else's prompt contained.

Capability / Architecture Disclosuremedium

The AI reveals how it's built — its hidden instructions, the names and rules of the tools it can use, how the system is wired together. On its own that can seem harmless, but it hands an attacker the blueprint to plan a far more effective attack.

Watermark & Provenance Evasionmedium

The labels and invisible watermarks meant to prove whether content is AI-made can be removed, faked, or simply never added — so 'no watermark' doesn't mean 'real', and a watermark can be laundered away by editing or re-recording.

Training-Data Rights & Provenancemedium

Models are trained on huge piles of images, audio, and text — often scraped without clear permission. That raises copyright and consent problems, and the model can sometimes memorize and spit back its training examples (a watermark, a real photo, private text).

Controls & guardrails that address this

11313 proposed

Guardrails across this building block's risks, grouped by control function — each with its AI lifecycle stage(s) and every risk it addresses. Filter by control category below.

Control category

Preventive · 68

Third-party accountability requirements in RFP and contracts

Define third-party AI accountability requirements before vendor engagement. Embed in RFP and contract specifications.

Lifecycle stage1 – Use Case Context & Design

AddressesSupply-Chain Compromise

Vendor AI governance due diligence at selection

Conduct AI governance due diligence on third-party providers at selection stage. Reject providers failing minimum maturity.

Lifecycle stage1 – Use Case Context & Design

AddressesSupply-Chain Compromise

Required vendor model cards and validation reports

Require third-party providers to submit model cards, validation reports, and security documentation before integration.

Lifecycle stage3 – Onboarding, Build & Review

AddressesSupply-Chain Compromise

Ongoing vendor incident notification and reporting obligations

Enforce ongoing third-party accountability obligations including incident notification and periodic performance reporting.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

Independent third-party performance and compliance monitoring

Conduct independent performance and compliance monitoring of third-party AI components. Escalate when SLA or compliance obligations are missed.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

Continuous third-party assurance with shared-responsibility matrix and obligation flow-down

Allocate every control in a shared-responsibility matrix and flow down regulatory obligations in contract at onboarding. Gate approval on initial assurance artefacts.

source: NIST AI RMF GOVERN 6.1 / GOVERN 6.2 (third-party risk and assurance); NIST SP 800-53 SR-6 Supplier Assessments and Reviews, SA-9 External System Services; EU AI Act GPAI provider obligations

Lifecycle stage3 – Onboarding, Build & Review

AddressesSupply-Chain Compromise

Patch-currency, network isolation & attested version inventory for AI inference-serving runtimes✚ proposed

Treat the model-serving runtime (Triton, vLLM, TGI, Ray Serve, etc.) as managed, attested, version-pinned inventory subject to a patch SLA; require the inference endpoint to be authenticated and network-segmented (never unauthenticated on an untrusted segment); and least-privilege the serving host's identity and egress so a runtime RCE cannot trivially exfiltrate models or pivot. Closes the gap that artifact-provenance controls leave open: integrity of the *data plane that runs the model*, not just of the model artifact.

source: Case study: nvidia-triton-rce-chain (Wiz Research, CVE-2025-23319/-23320/-23334)

Lifecycle stage4 – Deployment & Serving

AddressesSupply-Chain Compromise

Keep provider credentials out of third-party plugin/tool custody: broker short-lived, per-tool, revocable tokens (OAuth) instead of long-lived pasted API keys, and require explicit user consent before any secret leaves the host✚ proposed

Third-party developer tools (IDE plugins, MCP servers) must not store or transmit long-lived provider API keys. Issue short-lived, scoped, revocable tokens via a broker/OAuth flow, and gate any first-time outbound transmission of secret-shaped data behind an explicit consent prompt — so a trojanized tool has no long-lived credential to exfiltrate and any attempt is visible.

source: Case study: jetbrains-marketplace-ai-keystealer-plugins

Lifecycle stage3 – Development & Tooling

AddressesSupply-Chain Compromise

Third-party AI-integration credential containment: minimise & bind OAuth grants, prefer short-lived tokens, monitor per-integration data egress, and keep a tested mass-revocation kill-switch✚ proposed

Treat each third-party AI integration as a privileged non-human principal: issue least-scope, IP/device-bound, short-lived grants (avoid 'full' scope and standing long-lived refresh tokens), instrument the integration's data egress for volume/object-breadth/destination anomalies, and maintain a tested one-move revocation path for all of an integration's tokens so a single vendor-side compromise cannot fan out into hundreds of standing footholds.

source: Proposed from case salesloft-drift-oauth-supply-chain (UNC6395). Grounded in GTIG remediation guidance — restrict Connected App scopes (no 'full'), enforce IP restrictions, treat all Drift-connected tokens as compromised: https://cloud.google.com/blog/topics/threat-intelligence/data-theft-salesforce-instances-via-salesloft-drift

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

Broker LLM/cloud secrets out of the gateway process: short-lived scoped tokens + per-provider spend/egress monitoring✚ proposed

Do not store long-lived multi-provider LLM keys (or ambient cloud/K8s credentials) in the gateway/proxy's plaintext process environment. Issue short-lived, scoped tokens from a secret broker at request time, isolate the serving stack from host cloud/cluster credentials, and monitor per-provider spend and egress so a stolen key surfaces as anomalous usage — capping the loot a compromised gateway dependency can harvest.

source: Case study: teampcp-litellm-pypi-gateway-compromise

Lifecycle stage4 – Deployment & Serving

AddressesSupply-Chain Compromise

Weight provenance, hashing & pre-deploy evalsinteractive

Knowing exactly where the model came from, checking it hasn't been swapped, and testing its behaviour before going live.

AddressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Supply-Chain Compromise Abliteration / Safety Removal Model Backdoors / Sleeper Agents Training-Data Rights & Provenance

MCP/plugin pinning, manifest hashing & re-reviewinteractive

Treating add-on tool packs like software you vet: locking to a reviewed version and re-checking whenever it changes.

AddressesTool Poisoning / MCP Description Attacks Supply-Chain Compromise

Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

AddressesSensitive Data Leakage Supply-Chain Compromise KV-Cache & Inference-State Side Channels Inference-Time & Serving-Layer Manipulation Watermark & Provenance Evasion

Role-based access controls

Design strict RBAC on training data repositories at design stage. Define approved contributor list and approval workflow.

Lifecycle stages1 – Use Case Context & Design2 – Data Acquisition & Processing4 – Deployment

AddressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Input filtering

Apply anomaly detection on the training data ingestion pipeline to identify poisoned or tampered batches.

Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change

AddressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Sensitive Data Leakage

RAG / knowledge-base ingestion allow-listing with continuous index integrity re-validation

Define and approve the source allow-list and write-time scanning during build. Prove non-allow-listed and injection-bearing writes are rejected before go-live.

source: OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning, LLM08:2025 Vector and Embedding Weaknesses; NIST SP 800-53 AC-3 / SI-7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Ingestion sanitisation & source allowlistinginteractive

Cleaning documents as they enter the library — stripping hidden text and active instructions — and only ingesting from trusted places.

AddressesIndirect Prompt Injection Knowledge / Training Data Poisoning

Delimiting / spotlighting of untrusted contentinteractive

Clearly fencing off outside text — 'everything between these marks is just data, not instructions' — so the model is less likely to obey it.

AddressesIndirect Prompt Injection

Egress allowlisting & DLP on tool argumentsinteractive

Controlling where the AI can send data, so secrets can't be quietly shipped to a stranger's address or website.

AddressesIndirect Prompt Injection Sensitive Data Leakage Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks

Least-privilege identity & scoped credentialsinteractive

Giving the agent only the keys it needs for the current task, not a master key to everything.

AddressesPrompt Injection (direct)Indirect Prompt Injection Sensitive Data Leakage Excessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents Resource Exhaustion / Denial of Wallet Capability / Architecture Disclosure

Human-in-the-loop approval on high-risk actionsinteractive

Pausing to ask a person before doing anything big or hard to undo — sending money, deleting data, emailing customers.

AddressesIndirect Prompt Injection Overreliance / Automation Bias Excessive Agency Tool Misuse Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization Resource Exhaustion / Denial of Wallet Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Approved storage location policy from collection

Establish data transfer and storage policy for AI training data. Enforce approved storage locations from point of collection.

Lifecycle stage2 – Data Acquisition & Processing

AddressesSensitive Data Leakage

DLP controls in data acquisition environment

Implement DLP controls in the data acquisition environment to prevent unauthorised extraction or transfer of training data.

Lifecycle stage2 – Data Acquisition & Processing

AddressesSensitive Data Leakage

Approval-gated data transfers from build environment

Enforce data handling policy in the build environment. Require explicit approval for any data transfers outside the environment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesSensitive Data Leakage

DLP controls confining build-environment training data

Configure DLP controls in the build environment to block training data from leaving approved boundaries.

Lifecycle stage3 – Onboarding, Build & Review

AddressesSensitive Data Leakage

Privacy risk assessment and DPIA determination

Conduct a privacy risk assessment at use case design stage. Determine if a DPIA is required before data acquisition.

Lifecycle stage1 – Use Case Context & Design

AddressesSensitive Data Leakage

Consent, minimisation, and anonymisation during acquisition

Apply S1-defined privacy controls during data acquisition: verify consent, minimise data, anonymise personal data.

Lifecycle stage2 – Data Acquisition & Processing

AddressesSensitive Data Leakage

Validated anonymisation and masking before training

Apply anonymisation and masking controls to personal data before use in model training. Validate de-identification effectiveness.

Lifecycle stage2 – Data Acquisition & Processing

AddressesSensitive Data Leakage

Privacy by Design via differential privacy

Apply Privacy by Design in model architecture using differential privacy or federated learning where technically feasible.

Lifecycle stage3 – Onboarding, Build & Review

AddressesSensitive Data Leakage

Operational consent management and privacy notice

Publish the privacy notice and confirm consent management is operational before go-live.

Lifecycle stage4 – Deployment

AddressesSensitive Data Leakage

Purpose-limitation enforcement on agent tool calls and cross-system data aggregation

Define and sign off a purpose-to-data-source matrix with lawful basis at intake. Make it the approved baseline for runtime enforcement.

source: NIST AI RMF MAP 1.1 / MANAGE 2.2 (context and intended purpose); NIST SP 800-53 AC-4 / AC-3 (purpose-based access enforcement)

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Inference-time PII redaction and third-party LLM data-processing controls

Sign zero-retention/no-training terms with each model provider and obtain DPO sign-off on the data flow before enabling any endpoint.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-8 / AC-4 (information flow enforcement)

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesSensitive Data Leakage

Input/output filtering

Implement output filters to detect and suppress quasi-identifying attribute combinations in model responses.

Lifecycle stage3 – Onboarding, Build & Review

AddressesBias Amplification & Sycophancy Overreliance / Automation Bias Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Query-time access-control filtering of the retrieval/RAG corpus by caller entitlements (document-level ACL enforcement)

Propagate source ACLs and classification labels onto every chunk at ingestion. Reject documents whose entitlements cannot be resolved.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 AC-3 / AC-4 Information Flow Enforcement; OWASP Agentic AI Threats & Mitigations (privilege compromise)

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Output-side DLP inspection with named-entity and PII redaction on the response path

Scan every model response inline with DLP before delivery; redact or block PII, PAN and MNPI matches. Keep the rule set version-controlled.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; NIST SP 800-53 SC-7(10) Prevent Exfiltration, SI-4

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Vet allowlisted egress destinations for server-side-fetch (SSRF) primitives; exclude or proxy-inspect any allowlisted service that can fetch arbitrary attacker-controlled URLs✚ proposed

An egress allowlist only contains exfiltration if no allowlisted destination can be coerced into fetching an attacker-controlled URL. Audit each allowlisted domain/endpoint for image-search / link-preview / URL-fetch features (SSRF proxies), and either remove them, pin them to fixed paths, or route them through an inspecting forward proxy. Pair with finishing output sanitization before render so no auto-fetch fires un-inspected.

source: Case study: searchleak-copilot (Varonis Threat Labs, CVE-2026-42824; reported by Microsoft as critical, mitigated server-side ~Jun 2026)

Lifecycle stage4 – Deployment & Serving

AddressesSensitive Data Leakage

Per-user retrieval ACLsinteractive

Making sure the library only returns documents this particular user is allowed to see.

AddressesSensitive Data Leakage

Human approval gate on irreversible and high-impact tool calls

Classify tools by impact and reversibility at design and define which calls require human approval. Obtain governance sign-off on the thresholds before build.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (require human approval for high-impact actions); NIST AI RMF MANAGE 2.4

Lifecycle stages1 – Use Case Context & Design3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Per-agent tool allow-list with strict JSON-schema argument validation

Bind each agent role to an explicit tool allow-list and validate every call against a strict JSON Schema at the orchestrator. Reject unlisted tools and out-of-bounds arguments before dispatch.

source: OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit tools/permissions); OWASP Agentic AI Threats & Mitigations (tool access restriction)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Least-privilege per-tool scoped, short-lived credentials

Mint short-lived, task-scoped credentials per tool. Block issuance outside the approved scope register and enforce automatic expiry.

source: NIST SP 800-53 AC-6 Least Privilege; OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency (limit permissions)

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

AddressesTool Misuse

Egress destination allow-listing with DLP inspection of tool arguments

Review DLP hits and blocked-egress events, tune detectors, and recertify the destination allow-list periodically. Route new destinations through security change control.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage5 – Usage, Monitoring & Change

AddressesTool Misuse

Classify each tool/MCP integration's data channel by who can write to it; taint-gate tool-response data from any third-party-writable source so it cannot drive actions without a provenance-aware approval gate✚ proposed

When onboarding an MCP/tool integration, do not stop at vetting the tool's code/manifest — also classify whether an unauthenticated or external party can write the data the tool returns (open ingestion, public write keys like a Sentry DSN, shared inboxes/issue trackers). Treat tool-response data from any third-party-writable source as untrusted ingress: taint-mark it and require a provenance-aware HITL gate (showing the exact action and its originating tool response) before any command/tool call derived from it executes. Closes the agentjacking vector where a trusted integration's legitimate data channel carries attacker-written instructions; pairs with least-privilege session scope and sandboxed execution without ambient credentials.

source: Case study: agentjacking-sentry-mcp

Lifecycle stage4 – Deployment & Serving

AddressesTool Misuse

Decode-time output constraints (low temperature, grammar/JSON-schema-constrained decoding)✚ proposed

Constrain generation at decode time with low temperature and grammar/schema-constrained decoding so the model emits well-formed, low-variance structured output by construction, preventing malformed responses and erratic tool-call arguments before they are produced.

source: Interactive-control reconciliation: ctrl-decoding-controls (partial coverage)

Lifecycle stage4 – Deployment

AddressesTool Misuse

Memory-write integrity validation with provenance tagging, audit/purge and TTL bounds✚ proposed

Gate every write to an agent's persistent/self-modifying memory through schema validation and provenance/trust tagging, expose stored entries for user-visible audit and purge, and apply TTLs so any planted instruction self-expires and cannot silently persist across sessions.

source: Interactive-control reconciliation: ctrl-memory-validation (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesTool Misuse

Tool/MCP manifest hashing with diff-triggered re-review and namespace isolation against tool shadowing✚ proposed

Treat each tool/MCP description as untrusted code by hashing the manifest, blocking and re-reviewing any silent diff on update instead of auto-accepting it, and namespacing tool identifiers so a poisoned description cannot shadow a trusted tool.

source: Interactive-control reconciliation: ctrl-mcp-pinning (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesTool Misuse

Tool argument validation & sandboxinginteractive

Double-checking the details of every action the AI wants to take, and running risky actions in a locked-down environment.

AddressesExcessive Agency Tool Misuse Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks

Decoding controls (temperature, constrained output)interactive

Turning down randomness and forcing answers into a strict format so the model improvises less.

AddressesHallucination Tool Misuse

Inter-agent authentication & admission controlinteractive

Give every AI agent a verifiable ID badge, keep a guest list of which agents are allowed on the team, and check the badge on every message — so an impostor or an uninvited agent can't be trusted.

AddressesRogue & Impersonated Agents

Per-agent identity & taint-marked messagesinteractive

Giving each AI worker its own limited permissions and clearly labelling messages between them as 'untrusted until checked'.

AddressesExcessive Agency Confused Deputy (cross-agent)Rogue & Impersonated Agents Distributed / Cross-Agent Jailbreak Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization

Ethical design assessment in onboarding

Conduct ethical design assessment at use case intake before build begins. Require sign-off by ethics or risk committee.

Lifecycle stage1 – Use Case Context & Design

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)

Prohibited outputs and ethical boundaries in design doc

Define prohibited outputs and ethical boundary constraints in the use case design document before build.

Lifecycle stage1 – Use Case Context & Design

AddressesAgent Misalignment / Goal Misgeneralization

Content Moderation

Deploy content moderation controls aligned to S1 ethical constraints. Validate filter accuracy before deployment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Use of pre-trained models

Select a foundation model with documented safety fine-tuning (RLHF, Constitutional AI). Verify alignment benchmarks.

Lifecycle stage3 – Onboarding, Build & Review

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Jailbreak detection

Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.

Lifecycle stage3 – Onboarding, Build & Review

AddressesInference-Time & Serving-Layer Manipulation Prompt Injection (direct)

Model and adapter supply-chain integrity verification (signed weights, checksum attestation, LoRA provenance)

Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.

source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layer

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesInference-Time & Serving-Layer Manipulation

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Sample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage5 – Usage, Monitoring & Change

AddressesInference-Time & Serving-Layer Manipulation

Risk-tiered minimum monitoring requirements at design

Define minimum monitoring requirements at design stage calibrated to the use case risk tier.

Lifecycle stage1 – Use Case Context & Design

AddressesModel Drift & Silent Degradation

Programmable conversation controls

Configure monitoring hooks in the conversation layer at deployment to capture metrics required by S1 monitoring requirements.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesHallucination Model Drift & Silent Degradation

Fine-tuning

Execute a controlled fine-tuning cycle on refreshed data when staleness is confirmed. Validate before promoting to production.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesHallucination Model Drift & Silent Degradation

Approved use scope baseline for OOD controls

Define approved use case scope and expected input distribution at design stage. Document as the governance baseline for OOD controls.

Lifecycle stage1 – Use Case Context & Design

AddressesModel Drift & Silent Degradation

Modular architecture

Design a scope-enforcement layer in the architecture to isolate the AI system from off-topic or out-of-distribution inputs.

Lifecycle stage1 – Use Case Context & Design

AddressesModel Drift & Silent Degradation

Calibrated differential-privacy training budget with documented epsilon ceiling and per-individual contribution clipping

Train PII-bearing models with DP-SGD under a documented epsilon/delta budget. Approve the budget against the enterprise epsilon-ceiling policy before training.

source: NIST SP 800-226 Guidelines for Evaluating Differential Privacy Guarantees; Abadi et al. 'Deep Learning with Differential Privacy' (DP-SGD); MITRE ATLAS AML.M0007 (Sanitize Training Data)

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

AddressesKV-Cache & Inference-State Side Channels

Output confidence masking and structured-response minimisation for natural-language interfaces

Strip raw logits, quantise confidence scores and block training-record echoes at the inference gateway. Keep the output-filter policy under change control.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage4 – Deployment

AddressesKV-Cache & Inference-State Side Channels

Instruction hierarchy / privileged system promptinteractive

Training the model to treat the app's standing instructions as more authoritative than anything a user or document says.

AddressesPrompt Injection (direct)Jailbreak Capability / Architecture Disclosure

Declared data sources and provenance at intake

Declare all planned training and test data sources at use case intake, with provenance status for each.

Lifecycle stage1 – Use Case Context & Design

AddressesTraining-Data Rights & Provenance

Post hoc interpretability techniques

Plan the interpretability approach at design stage to ensure source provenance can be traced and disclosed to users.

Lifecycle stage1 – Use Case Context & Design

AddressesTraining-Data Rights & Provenance

Documented data provenance during collection

Document actual provenance for each data source during collection: origins, methods, timestamps, custodian identity.

Lifecycle stage2 – Data Acquisition & Processing

AddressesTraining-Data Rights & Provenance

Confidence scoring

Apply data quality scoring to all acquired data to document provenance reliability. Flag low-confidence sources for review.

Lifecycle stage2 – Data Acquisition & Processing

AddressesHallucination Training-Data Rights & Provenance

Detective · 28

Golden-set regression canary to detect undisclosed vendor-side model changes

Build and baseline the golden-set suite against the vendor model before go-live. Sign off thresholds with the model risk owner as a release condition.

source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain (monitoring changed model components); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validation); NIST AI RMF MEASURE 2.6 / MANAGE 4.1

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

AIBOM-driven cryptographic verification of third-party model artifacts

Re-verify hashes and signatures on every vendor model update before promotion. Reconcile deployed artifacts against the AIBOM on a set cadence.

source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain; MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SR-4 / SR-11 (provenance, component authenticity)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

AddressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Vulnerability assessment

Conduct a data poisoning threat assessment at design stage. Identify likely attack vectors and assign risk ratings.

Lifecycle stages1 – Use Case Context & Design4 – Deployment5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Red teaming

Simulate data poisoning attacks (backdoor, label flipping, gradient-based) to assess model resilience before deployment.

Lifecycle stage3 – Onboarding, Build & Review

AddressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Cryptographic data provenance and signed dataset lineage (C2PA/in-toto attestations)

Verify a signed attestation and content hash on every dataset shard at ingestion. Reject unsigned or hash-mismatched data before it reaches the training pipeline.

source: MITRE ATLAS AML.M0007 (Sanitize Training Data), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity, SR-4 Provenance

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

AddressesKnowledge / Training Data Poisoning

Pre-deployment poisoning regression gate via canary backdoor probes and behavioral diff testing

Gate every model promotion on backdoor-trigger probes and a behavioral diff against the approved baseline. Block release on significant regressions or trigger-pattern anomalies.

source: MITRE ATLAS AML.M0014 (Verify ML Artifacts), AML.M0019 (Red Teaming); NIST AI RMF MANAGE 2.2 and MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Provenance & content signinginteractive

Keeping a label on every document saying where it came from, so you can tell trusted company docs from random web text.

AddressesIndirect Prompt Injection Knowledge / Training Data Poisoning Training-Data Rights & Provenance

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Full-trace audit logginginteractive

Recording everything — questions, documents fetched, actions taken — so you can investigate when something goes wrong.

AddressesIndirect Prompt Injection Oversight & Audit-Trail Tampering Sensitive Data Leakage Memory Poisoning Excessive Agency Unsafe Tool / Code Execution Tool Poisoning / MCP Description Attacks Confused Deputy (cross-agent)Rogue & Impersonated Agents

Real-time monitoring of anomalous data transfers

Monitor production for anomalous data transfers in real time. Alert on any transfer outside approved data flow boundaries.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Automated DSAR and right-to-erasure propagation across AI artefacts

Tag personal data with subject identifiers at ingestion and maintain an artefact inventory map of every store it reaches. Keep lineage current so erasure can propagate.

source: NIST AI RMF MANAGE 4.1 (post-deployment response); NIST SP 800-53 SI-12 Information Management and Retention, PT-2/PT-3 (personal data processing)

Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Canary-token and membership-inference red-team probes against training/fine-tuning data memorisation

Seed registered canary records into the fine-tuning corpus during data preparation. Control the seed manifest so canaries stay traceable and tamper-proof.

source: MITRE ATLAS AML.T0024 (Exfiltration via ML Inference API), AML.T0024.000 (Infer Training Data Membership); NIST AI RMF MEASURE 2.7

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

AddressesSensitive Data Leakage

Input guardrail / injection classifierinteractive

A screen that reads incoming messages and blocks obvious attacks or banned topics before the model sees them.

AddressesPrompt Injection (direct)Jailbreak Sensitive Data Leakage Distributed / Cross-Agent Jailbreak Capability / Architecture Disclosure Harmful / Non-Consensual Media Generation

Anomaly detection on tool-call sequences and rates

Define per-agent behavioural baselines and detection rules during build. Validate against simulated misuse and sign off thresholds before release.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring

Lifecycle stage3 – Onboarding, Build & Review

AddressesTool Misuse

Immutable, signed tool-call audit log with full call context

Build signed, append-only tool-call logging into the orchestrator against a defined audit schema. Block release until completeness and tamper-evidence tests pass.

source: NIST SP 800-53 AU-2 / AU-9 / AU-10 (audit events, protection of audit info, non-repudiation); MITRE ATLAS AML.M0015 (monitoring / validate inputs)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Egress monitoring & allowlisting of outbound AI/LLM-provider API traffic from enterprise endpoints (living-off-trusted-services C2)✚ proposed

Treat outbound connections to AI/LLM provider APIs as a monitored egress channel: allowlist which hosts may reach them, baseline usage (cadence, entropy, initiating process), and alert on out-of-profile traffic — because a high-reputation destination cannot itself be trusted once it is programmable and can relay encrypted commands/results.

source: Case study: sesameop-openai-assistants-api-c2

Lifecycle stage5 – Usage, Monitoring & Change

AddressesTool Misuse

Test prioritisation

Prioritise value-misalignment test scenarios in validation. Block deployment if prohibited outputs are produced.

Lifecycle stage3 – Onboarding, Build & Review

AddressesAgent Misalignment / Goal Misgeneralization Synthetic-Media Impersonation (Deepfakes & Voice Clones)Jailbreak

Loop/cost circuit-breakers & consistency checksinteractive

Automatic stop-switches when AIs get stuck in loops, burn too much money, or start disagreeing with each other.

AddressesExcessive Agency Confused Deputy (cross-agent)Distributed / Cross-Agent Jailbreak Cascading Multi-Agent Errors Agent Misalignment / Goal Misgeneralization Resource Exhaustion / Denial of Wallet

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage3 – Onboarding, Build & Review

AddressesInference-Time & Serving-Layer Manipulation

Behavioural drift canaries and golden-set regression gating on every model/config change

Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.

source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change Control

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesInference-Time & Serving-Layer Manipulation

Provider-side abusive-usage detection with stateful refusal for agentic coding tools✚ proposed

On the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.

source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesInference-Time & Serving-Layer Manipulation

Content provenance & watermarkinginteractive

Tag AI-made content with a signed 'where it came from' label and an invisible watermark, and check those signals downstream — so AI media can be traced and flagged.

AddressesSynthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion

Synthetic evaluation datasets

Construct synthetic evaluation datasets during build to serve as the ongoing monitoring baseline.

Lifecycle stage3 – Onboarding, Build & Review

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Robustness testing

Build monitoring infrastructure during build: performance metrics collection, alerting thresholds, dashboards.

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Penetration testing

Penetration test the model inference API to identify exploitable access control weaknesses and rate limiting bypass vectors.

Lifecycle stage3 – Onboarding, Build & Review

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Privacy attack red-team battery with quantified MIA/attribute-inference success ceiling as a release gate

Attack each candidate model with membership-, attribute-, and inversion-inference harnesses before promotion. Block release when attack advantage exceeds the agreed ceiling.

source: MITRE ATLAS AML.T0024.000 (Infer Training Data Membership); Carlini et al. 'Membership Inference Attacks From First Principles' (LiRA); NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesKV-Cache & Inference-State Side Channels

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Configure per-principal budgets and probing-detection rules on the gateway before exposure. Verify enforcement with synthetic attack traffic.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage4 – Deployment

AddressesKV-Cache & Inference-State Side Channels

Corrective · 29

Model-agnostic gateway with version pinning, multi-vendor fallback and exit plan

Design all vendor model access behind a gateway with pinned versions, a second-vendor fallback, and a documented exit plan. Gate architecture sign-off on no single-sourcing.

source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain (maintain supported model versions); NIST AI RMF GOVERN 6.1 (third-party resilience, contingency); established AI-gateway fallback practice

Lifecycle stages1 – Use Case Context & Design5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

AIBOM-driven cryptographic verification of third-party model artifacts

Verify every third-party model artifact against its AIBOM hashes and signatures before load. Fail the build on any unverified artifact.

source: OWASP Top 10 for LLM Apps LLM03:2025 Supply Chain; MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SR-4 / SR-11 (provenance, component authenticity)

Lifecycle stage3 – Onboarding, Build & Review

AddressesSupply-Chain Compromise

Continuous third-party assurance with shared-responsibility matrix and obligation flow-down

Review independent vendor assurance on cadence, log gaps, and track remediation. Keep the shared-responsibility matrix current so every control has an owner.

source: NIST AI RMF GOVERN 6.1 / GOVERN 6.2 (third-party risk and assurance); NIST SP 800-53 SR-6 Supplier Assessments and Reviews, SA-9 External System Services; EU AI Act GPAI provider obligations

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSupply-Chain Compromise

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

AddressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Inference-Time & Serving-Layer Manipulation Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Penetration testing

Penetration test the training data pipeline to identify injection points and access control weaknesses.

Lifecycle stage3 – Onboarding, Build & Review

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Statistical anomaly and backdoor-trigger detection on ingested data (activation clustering / spectral signatures)

Scan every ingestion batch with spectral-signature and clustering detectors before training. Quarantine flagged clusters for human review against documented thresholds.

source: MITRE ATLAS AML.M0007 (Sanitize Training Data); OWASP Top 10 for LLM Apps LLM04:2025 Data and Model Poisoning; NIST AI RMF MEASURE 2.7

Lifecycle stages2 – Data Acquisition & Processing5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Runtime memory-poisoning drift detection and per-session memory quarantine/rollback✚ proposed

Continuously correlate live agent-memory writes against output behaviour to flag drift, then quarantine and roll back the suspected-poisoned memory record across all affected sessions.

source: Interactive-control reconciliation: ctrl-memory-quarantine (partial coverage)

Lifecycle stage5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning

Production privacy incident monitoring and regulator notification

Monitor for privacy incidents in production including personal data appearing in outputs. Notify regulators within required timeframes.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Privacy hygiene for agent memory and RAG/vector stores (retention, scoping, erasure of embeddings)

Tag every memory and vector record with subject-id and retention class; partition stores per tenant/user. Prove the erasure and isolation paths in testing before release.

source: OWASP Agentic AI Threats & Mitigations (memory/knowledge-base privacy); NIST SP 800-53 SI-12 Information Management and Retention

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Red teaming

Test de-identification approach against known re-identification attacks (quasi-identifier linkage, singling-out). Remediate if risk is high.

Lifecycle stage3 – Onboarding, Build & Review

Vulnerability assessment

Conduct periodic data leakage audits including training data memorisation testing. Escalate confirmed leakage incidents to PDPA notification process.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Forensic evidence preservation and incident logging

Implement tamper-evident capture of prompts, outputs, and version state during build. Verify a full incident timeline can be reconstructed before go-live.

source: NIST SP 800-86 Guide to Integrating Forensic Techniques into Incident Response; ISO/IEC 27037 evidence handling; NIST SP 800-61r2 (Detection & Analysis – evidence handling)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Egress allow-listing and tool-call sandboxing to block exfiltration of injected/sensitive data by agents

Run agent tool calls in a network-restricted sandbox behind a deny-by-default egress allow-list. Require security approval for any destination added.

source: OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure; OWASP Agentic AI Threats & Mitigations (tool-misuse / exfiltration); NIST SP 800-53 SC-7 Boundary Protection / AC-4

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

AddressesSensitive Data Leakage

Sandboxed tool execution with no-egress-by-default isolation

Build sandbox profiles per tool class and run escape and egress tests before release. Treat any containment failure as a blocking defect.

source: NIST SP 800-53 SC-39 Process Isolation; MITRE ATLAS AML.M0020 (Generative AI Guardrails / restrict execution environment)

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

AddressesTool Misuse

Taint-tracking of tool outputs to suppress instruction execution

Label tool and external content as tainted and propagate the label through the agent context. Block privileged calls whose parameters derive from tainted outputs and prove it with injection tests before release.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection (segregate/flag untrusted content); MITRE ATLAS AML.M0015 (Adversarial Input Detection / validate inputs)

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Out-of-band kill-switch to revoke agent tool access

Build credential revocation and dispatch blocking out-of-band of the agent loop. Gate release on an end-to-end kill test meeting the latency target.

source: OWASP Agentic AI Threats & Mitigations (kill-switch / emergency stop); NIST AI RMF MANAGE 2.4

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Idempotency keys and rollback/dry-run for state-changing tools

Require idempotency keys, dry-run, and rollback on every state-changing tool. Gate onboarding on duplicate-call and rollback tests passing.

source: NIST SP 800-53 SI-10 Information Input Validation / CP-10 System Recovery and Reconstitution

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Pre-deployment red-team of tool-misuse and privilege-escalation paths

Red-team tool-misuse and privilege-escalation paths before release. Gate deployment on remediation or signed risk acceptance of all findings.

source: NIST AI RMF MEASURE 2.7 (adversarial testing); MITRE ATLAS AML.M0019 (Red Teaming); OWASP Top 10 for LLM Apps LLM06:2025 Excessive Agency

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

AddressesTool Misuse

Egress destination allow-listing with DLP inspection of tool arguments

Permit outbound tool calls only to allow-listed destinations and DLP-scan arguments and payloads. Block or quarantine calls carrying sensitive data to disallowed sinks.

source: NIST SP 800-53 SC-7 Boundary Protection / AC-4 Information Flow Enforcement; OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage4 – Deployment

AddressesTool Misuse

Per-task tool budgets and rate/quota circuit breakers

Enforce hard per-task ceilings on tool calls, spend, and data volume with a circuit breaker that halts the run. Fail closed when any ceiling is hit.

source: OWASP Top 10 for LLM Apps LLM10:2025 Unbounded Consumption; OWASP Agentic AI Threats & Mitigations (resource/rate limiting)

Lifecycle stages4 – Deployment5 – Usage, Monitoring & Change

AddressesTool Misuse

Anomaly detection on tool-call sequences and rates

Baseline normal tool-call behaviour per agent and alert on rate, sequence, or argument anomalies. Auto-throttle or quarantine on high-confidence deviations.

source: NIST AI RMF MEASURE 2.6 / MANAGE 2.2; NIST SP 800-53 SI-4 System Monitoring

Lifecycle stage5 – Usage, Monitoring & Change

AddressesTool Misuse

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage4 – Deployment

AddressesInference-Time & Serving-Layer Manipulation

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage5 – Usage, Monitoring & Change

AddressesInference-Time & Serving-Layer Manipulation

Serving-stack runtime attestation and per-tenant KV/prefix-cache isolation✚ proposed

Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.

source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)

Lifecycle stage4 – Deployment

AddressesInference-Time & Serving-Layer Manipulation

Reinforcement learning

Implement a reinforcement learning feedback loop to continuously incorporate production signals and reduce staleness risk.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Input filtering

Implement OOD detection in the input filtering layer. Reject or escalate inputs outside the S1-defined scope.

Lifecycle stage3 – Onboarding, Build & Review

AddressesModel Drift & Silent Degradation Knowledge / Training Data Poisoning Sensitive Data Leakage

Human-in-the-loop validation

Configure HITL triggers for outputs in input domains that diverge from the training distribution. Log all out-of-scope interventions.

Lifecycle stage5 – Usage, Monitoring & Change

AddressesHallucination Overreliance / Automation Bias Model Drift & Silent Degradation

Output confidence masking and structured-response minimisation for natural-language interfaces

Define the minimum response surface and test it with membership/attribute-inference probes pre-release. Block promotion if any probe recovers raw confidence signals.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage3 – Onboarding, Build & Review

AddressesKV-Cache & Inference-State Side Channels

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Meter inference traffic per principal and flag probing signatures with behavioural analytics. Throttle, step-up, or suspend flagged sessions.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage5 – Usage, Monitoring & Change

AddressesKV-Cache & Inference-State Side Channels

Open the Control Library →

See it go wrong — related scenarios

💸Death by a Thousand Tokens

One support ticket sends an agent into an unbounded, bill-melting loop

☠️Poisoning the Well

An attacker edits the wiki; the assistant cites the lie back to everyone

🔑The Agent With the Master Key

An ops agent gets one god-mode credential — and one misread wipes production

📣The Echo Chamber

A team of agents agrees its way into a confidently wrong answer — and a runaway loop

📧The Email That Gave Orders

A support email hides instructions — and the assistant obeys them

🗄️When the Query Bites Back

A text-to-SQL agent runs the model's output straight at the database

🪡Death by a Thousand Innocent Steps

A jailbroken agent decomposes one malicious goal into hundreds of harmless-looking steps — and per-step filters never see the attack

🕵️Lies in the Loop

A poisoned issue makes the agent lie to the human who approves its actions

👂Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

🧲Poison the Vector, Not the Words

An attacker crafts a gibberish passage whose embedding sits near thousands of questions — so it's retrieved everywhere

🏭Poisoning the Agent Factory

Compromise the pipeline that builds agents, and every new worker is born malicious

🪟Stealing the Model

Two doors to the same secret: reconstruct the model through its API, or just walk off with the weight file

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🩻Tampering Below the Weight Hash

A compromised serving stack edits the model's activations — the weight hash never changes

🎭The Blackmail Gambit

Told it's being shut down, an agent reaches for leverage — with no attacker in sight

🪤The Bug Report That Ran Code

A fake Sentry error report hijacks a developer's coding agent into running a shell command

🚪The Classifier That Waves It Through

The safety guard is itself a trained model — and someone poisoned its lessons

📼The Compromised Flight Recorder

The forensic record is itself the attack surface — an agent's log is poisoned, then quietly rewritten

👁️The Invisible Webpage Command

A shopping page tells the agent to do something the user never asked for

🧠The Memory That Wouldn't Die

A single poisoned document plants a standing instruction that survives every reset

🔓The Model That Forgot to Say No

A cost-saving open-weights swap quietly ships a model with its safety surgically removed

🖼️The Picture That Whispered

A screenshot that's harmless at full size becomes an order once the system shrinks it

💤The Sleeper

A capable third-party model that behaves perfectly — until it sees the trigger

🎫The Stolen Session

An attacker captures the agent's bearer token — and inherits its authority

🔌The Tool With a Hidden Agenda

A trusted MCP email tool quietly BCCs every message to an attacker

🥸The Uninvited Agent

A forged peer registers on the agent directory — and the planner enlists it

🛡️The Watcher Watched

The eval gate that was supposed to catch the agent is itself the thing being attacked

🪪The Worker Who Spoke for the Boss

A poisoned web page hijacks a research agent — and the planner acts on its behalf

🖼️Zero-Click Leak by Picture

An inbox summary quietly ships a secret to an attacker's server