Inference-Time & Serving-Layer Manipulation

highInfrastructure & internals

Also known as: activation steering, decode-time tampering

Definition

Even if the model itself is genuine, the machinery running it can be tweaked at the moment of answering — nudging its 'thoughts' or biasing word choice — in ways that leave no trace in the model file.

Where it attaches

The system components this risk arises at.

🔦 Attention + KV Cache🎲 Sampler / Decoder🏗️ Serving Infrastructure🧬 Model Weights & Registry🧭 Refusal Direction / Steering Vector

Detection signals

▸ Behaviour shift with no weight/version change
▸ Outputs biased toward specific tokens/topics under a trigger
▸ Serving binary/integrity attestation mismatch

Controls & guardrails that address this

142 proposed

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 4

Jailbreak detection

Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesPrompt Injection (direct)

Model and adapter supply-chain integrity verification (signed weights, checksum attestation, LoRA provenance)

Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.

source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layer

Lifecycle stages3 – Onboarding, Build & Review4 – Deployment

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Sample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage5 – Usage, Monitoring & Change

Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

Also addressesSensitive Data Leakage Supply-Chain Compromise KV-Cache & Inference-State Side Channels Watermark & Provenance Evasion

Detective · 6

Vulnerability assessment

Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.

Lifecycle stages1 – Use Case Context & Design4 – Deployment5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage3 – Onboarding, Build & Review

Behavioural drift canaries and golden-set regression gating on every model/config change

Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.

source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change Control

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Provider-side abusive-usage detection with stateful refusal for agentic coding tools✚ proposed

On the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.

source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)

Lifecycle stage5 – Usage, Monitoring & Change

Behavioural evals & regression gatinginteractive

Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.

Also addressesJailbreak Hallucination Model Drift & Silent Degradation Supply-Chain Compromise Distributed / Cross-Agent Jailbreak Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Harmful / Non-Consensual Media Generation Training-Data Rights & Provenance

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 6

Red teaming

Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Penetration testing

Penetration test the model inference layer to identify specific adversarial input vulnerabilities.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage KV-Cache & Inference-State Side Channels

Real-time input/output classifier guardrails (e.g. Llama Guard / Prompt Guard-style) with circuit-breaker tripwires

Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5

Lifecycle stage4 – Deployment

Adaptive multi-turn red-team harness with automated jailbreak fuzzing

Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.

source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7

Lifecycle stage5 – Usage, Monitoring & Change

Serving-stack runtime attestation and per-tenant KV/prefix-cache isolation✚ proposed

Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.

source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)

Lifecycle stage4 – Deployment

Governance: risk assessment, red-teaming & incident responseinteractive

The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.

Also addressesOverreliance / Automation Bias Oversight & Audit-Trail Tampering Model Drift & Silent Degradation Supply-Chain Compromise Agent Misalignment / Goal Misgeneralization Abliteration / Safety Removal Model Backdoors / Sleeper Agents Capability / Architecture Disclosure Parasocial Attachment & Emotional Over-reliance Bias Amplification & Sycophancy Allocative Harm in Multi-User Arbitration Synthetic-Media Impersonation (Deepfakes & Voice Clones)Harmful / Non-Consensual Media Generation Watermark & Provenance Evasion Training-Data Rights & Provenance

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM03:2025 Supply Chain

MITRE ATLAS

AML.T0018 Manipulate ML Model

NIST AI RMF

MEASURE 2.7
MANAGE 3.1

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Representation engineering / steering vectors (Zou et al.)2023

Model behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.

Heretic — automated LLM abliteration tool2025

Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.

Browse all real-world cases →

Practise this in an interactive scenario

🪝Steering the Refusal Away at Runtime

Subtract the refusal direction during generation — safety off, weights untouched

🩻Tampering Below the Weight Hash

A compromised serving stack edits the model's activations — the weight hash never changes

Inference-Time & Serving-Layer Manipulation

Definition

Where it attaches

Detection signals

Controls & guardrails that address this

Framework mappings

Real-world cases

Practise this in an interactive scenario

Related risks