KV-Cache & Inference-State Side Channels

mediumInfrastructure & internals

Also known as: prefix-cache leakage, cross-tenant cache attacks

Definition

To go faster, servers reuse work between users who share the same opening text. That shortcut can leak clues — timing differences that reveal what someone else's prompt contained.

Where it attaches

The system components this risk arises at.

🔦 Attention + KV Cache🏗️ Serving Infrastructure🪟 Context Window🎲 Sampler / Decoder

Detection signals

▸ Measurable timing differences correlated with cache hits
▸ State unexpectedly shared across sessions/tenants
▸ Anomalous latency patterns under adversarial probing

Controls & guardrails that address this

Grouped by control function, with the AI lifecycle stage(s) to apply each and the other risks it addresses. Filter by control category below.

Control category

Preventive · 5

Role-based access controls

Design query rate limiting and RBAC for the model inference API at design stage to limit attack surface.

Lifecycle stages1 – Use Case Context & Design4 – Deployment

Also addressesKnowledge / Training Data Poisoning Prompt Injection (direct)Sensitive Data Leakage

Input/output filtering

Implement query pattern detection to identify systematic inference attack behaviour (high-volume queries, membership probing).

Lifecycle stage3 – Onboarding, Build & Review

Also addressesBias Amplification & Sycophancy Overreliance / Automation Bias Sensitive Data Leakage

Calibrated differential-privacy training budget with documented epsilon ceiling and per-individual contribution clipping

Train PII-bearing models with DP-SGD under a documented epsilon/delta budget. Approve the budget against the enterprise epsilon-ceiling policy before training.

source: NIST SP 800-226 Guidelines for Evaluating Differential Privacy Guarantees; Abadi et al. 'Deep Learning with Differential Privacy' (DP-SGD); MITRE ATLAS AML.M0007 (Sanitize Training Data)

Lifecycle stages2 – Data Acquisition & Processing3 – Onboarding, Build & Review

Output confidence masking and structured-response minimisation for natural-language interfaces

Strip raw logits, quantise confidence scores and block training-record echoes at the inference gateway. Keep the output-filter policy under change control.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage4 – Deployment

Serving-stack & provisioning attestation, cache isolationinteractive

Making sure the machinery running the model — and the template used to stamp out new agents — is the real, unmodified version, and that one user's data can't leak into another's through shared shortcuts.

Also addressesSensitive Data Leakage Supply-Chain Compromise Inference-Time & Serving-Layer Manipulation Watermark & Provenance Evasion

Detective · 5

Penetration testing

Penetration test the model inference API to identify exploitable access control weaknesses and rate limiting bypass vectors.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage

Vulnerability assessment

Conduct periodic inference attack vulnerability assessments as new attack methods emerge. Monitor query pattern anomalies.

Lifecycle stage5 – Usage, Monitoring & Change

Also addressesKnowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage

Privacy attack red-team battery with quantified MIA/attribute-inference success ceiling as a release gate

Attack each candidate model with membership-, attribute-, and inversion-inference harnesses before promotion. Block release when attack advantage exceeds the agreed ceiling.

source: MITRE ATLAS AML.T0024.000 (Infer Training Data Membership); Carlini et al. 'Membership Inference Attacks From First Principles' (LiRA); NIST AI RMF MEASURE 2.7

Lifecycle stages3 – Onboarding, Build & Review5 – Usage, Monitoring & Change

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Configure per-principal budgets and probing-detection rules on the gateway before exposure. Verify enforcement with synthetic attack traffic.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage4 – Deployment

Runtime monitoring & anomaly detectioninteractive

Live dashboards and alarms that notice unusual behaviour — spikes in errors, weird actions, sudden data access.

Corrective · 3

Red teaming

Conduct targeted red team exercises for inference attack categories (membership inference, model extraction, attribute inference) before deployment.

Lifecycle stage3 – Onboarding, Build & Review

Also addressesJailbreak Model Drift & Silent Degradation Knowledge / Training Data Poisoning Inference-Time & Serving-Layer Manipulation Prompt Injection (direct)Sensitive Data Leakage

Output confidence masking and structured-response minimisation for natural-language interfaces

Define the minimum response surface and test it with membership/attribute-inference probes pre-release. Block promotion if any probe recovers raw confidence signals.

source: MITRE ATLAS AML.T0024.001 (Invert ML Model); Jia et al. MemGuard (output perturbation defence); OWASP Top 10 for LLM Apps LLM02:2025 Sensitive Information Disclosure

Lifecycle stage3 – Onboarding, Build & Review

Per-principal query-budget and probing-behaviour anomaly detection on the inference API

Meter inference traffic per principal and flag probing signatures with behavioural analytics. Throttle, step-up, or suspend flagged sessions.

source: MITRE ATLAS AML.M0004 (Restrict Number of ML Model Queries), AML.T0024 (Exfiltration via ML Inference API); NIST SP 800-53 SI-4, AU-6

Lifecycle stage5 – Usage, Monitoring & Change

Open these in the Control Library →

Framework mappings

OWASP LLM Top 10

LLM02:2025 Sensitive Information Disclosure

MITRE ATLAS

—

NIST AI RMF

MEASURE 2.10

Real-world cases

Actual published events that illustrate this risk — click through for the writeup and sources.

Prefix/KV-cache timing side channels (e.g. InputSnatch)2025

Shared prefix/KV caching in LLM serving leaks information about other users' inputs via response-timing side channels.

Browse all real-world cases →

Practise this in an interactive scenario

👂Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

KV-Cache & Inference-State Side Channels

Definition

Where it attaches

Detection signals

Controls & guardrails that address this

Framework mappings

Real-world cases

Practise this in an interactive scenario

Related risks