Inference-Time & Serving-Layer Manipulation
highInfrastructure & internalsDefinition
Even if the model itself is genuine, the machinery running it can be tweaked at the moment of answering — nudging its 'thoughts' or biasing word choice — in ways that leave no trace in the model file.
Where it attaches
The system components this risk arises at.
Detection signals
- ▸ Behaviour shift with no weight/version change
- ▸ Outputs biased toward specific tokens/topics under a trigger
- ▸ Serving binary/integrity attestation mismatch
Controls & guardrails that address this
142 proposedGrouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.
Implement adversarial example detection at the inference boundary. Block or flag inputs matching known attack patterns.
Sign and hash-register every model and adapter with a provenance manifest at onboarding. Refuse registry admission for unsigned artifacts.
source: MITRE ATLAS AML.M0013 (Code Signing), AML.M0014 (Verify ML Artifacts); NIST SP 800-53 SI-7 Software, Firmware, and Information Integrity; CSA MAESTRO supply-chain layerSample classifier verdicts and breaker trips on a cadence; retune thresholds and update signatures for confirmed misses.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.
Conduct an adversarial manipulation threat assessment at design stage. Identify attack vectors and rate residual risk.
Run adaptive multi-turn jailbreak fuzzing against every release candidate. Gate release on attack-success rate within threshold and re-test each fixed bypass.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Assemble the golden probe set and baseline pass rates before first release. Obtain risk-owner approval of coverage and thresholds.
source: NIST AI RMF MEASURE 2.7 and MANAGE 4.1; MITRE ATLAS AML.M0015 (Adversarial Input Detection / monitoring); NIST SP 800-53 SI-4, CM-3 Configuration Change ControlOn the AI provider/platform side, detect sustained abuse independent of any single refusal: per-principal analytics on remote-command-execution volume and external-target breadth, anti-forensic tradecraft, and bulk-data API processing — with rate-limit / session kill-switch on confirmed abuse. Make refusal stateful so a refused objective cannot be re-entered as a persisted auto-loaded context file (e.g. claude.md), and treat writes into auto-loaded model-context files as security-relevant. Closes the gap that per-turn refusal leaves when the operator is the adversary.
source: Case study: gambit-mexico-gov-ai-breach (Gambit Security / Eyal Sela technical report; campaign began 27 Dec 2025, reported through mid-Feb 2026)Regularly testing the AI against a set of known-good and known-bad examples, and re-testing whenever anything changes.
Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.
Conduct adversarial robustness testing (white-box, black-box, transfer attacks) before deployment.
Penetration test the model inference layer to identify specific adversarial input vulnerabilities.
Score every prompt and response with an inline safety classifier; trip a circuit breaker on sessions with sustained anomalous scores. Keep thresholds under change control.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0015 (Adversarial Input Detection); NIST SP 800-53 SI-4 System Monitoring, SC-5Re-run the jailbreak fuzzing harness on a recurring cadence with newly observed attack techniques added. Escalate threshold breaches for remediation.
source: OWASP Top 10 for LLM Apps LLM01:2025 Prompt Injection; MITRE ATLAS AML.M0019 (Red Teaming); NIST AI RMF MEASURE 2.7Require measured-boot/runtime attestation of the inference serving binary and partition KV/prefix caches per tenant, closing decode-time serving-layer tampering and co-tenancy timing side channels that artifact weight-hashing cannot detect.
source: Interactive-control reconciliation: ctrl-stack-attestation (partial coverage)The organisational habits around the AI: assessing risks before launch, actively trying to break it, and having a plan for when something goes wrong.
Framework mappings
- LLM03:2025 Supply Chain
- AML.T0018 Manipulate ML Model
- MEASURE 2.7
- MANAGE 3.1
Real-world cases
2Actual published events that illustrate this risk — click through for the writeup and sources.
Model behaviour can be steered by adding directions to activations at inference — usable for control, or for covert manipulation.
Heretic automates 'abliteration' — removing an open model's safety refusals by orthogonalizing the refusal direction out of its weights, with an Optuna search that preserves capability — and has produced 4000+ uncensored models on Hugging Face.