🔍AI RiskAtlas
← Risk taxonomy

Inference-Time & Serving-Layer Manipulation

highInfrastructure & internals
Also known as: activation steering, decode-time tampering

Definition

Even if the model itself is genuine, the machinery running it can be tweaked at the moment of answering — nudging its 'thoughts' or biasing word choice — in ways that leave no trace in the model file.

Where it attaches

The system components this risk arises at.

🔦 Attention + KV Cache🎲 Sampler / Decoder🏗️ Serving Infrastructure🧬 Model Weights & Registry🧭 Refusal Direction / Steering Vector

Detection signals

  • Behaviour shift with no weight/version change
  • Outputs biased toward specific tokens/topics under a trigger
  • Serving binary/integrity attestation mismatch

Controls & guardrails that address this

142 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category
Preventive · 4
Jailbreak detection

Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.

Lifecycle stage3 – Onboarding, Build & Review
Model and adapter supply-chain integrity verification (signed weights, checksum attestation, LoRA provenance)

Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.

source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layer
Lifecycle stages3 – Onboarding, Build & Review4 – Deployment
Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Sample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5
Lifecycle stage5 – Usage, Monitoring & Change
Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

Detective · 6
Vulnerability assessment

Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.

Lifecycle stages1 – Use Case Context & Design4 – Deployment5 – Usage, Monitoring & Change
Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7
Lifecycle stage3 – Onboarding, Build & Review
Behavioural drift canaries and golden-set regression gating on every model/config change

Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.

source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change Control
Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change
Provider-side abusive-usage detection with stateful refusal for agentic coding tools✚ proposed

On the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.

source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)
Lifecycle stage5 – Usage, Monitoring & Change
Corrective · 6
Red teaming

Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.

Penetration testing

Penetration test the model inference layer to identify specific adversarial input vulnerabilities.

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5
Lifecycle stage4 – Deployment
Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7
Lifecycle stage5 – Usage, Monitoring & Change
Serving-stack runtime attestation and per-tenant KV/prefix-cache isolation✚ proposed

Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.

source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)
Lifecycle stage4 – Deployment
Open these in the Control Library →

Framework mappings

OWASP LLM Top 10
  • LLM03:2025 Supply Chain
MITRE ATLAS
  • AML.T0018 Manipulate ML Model
NIST AI RMF
  • MEASURE 2.7
  • MANAGE 3.1

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning — not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading →·Built by Shi Yuan ↗