Prefix/KV-cache timing side channels (e.g. InputSnatch)
Research demonstration27 Nov 2024πΊοΈ Inside the ModelShared prefix/KV caching in LLM serving leaks information about other users' inputs via response-timing side channels.
Root cause β why it happened
To answer faster and cheaper, the machinery running the model remembers the work it already did for a common opening of a prompt β and reuses it for the next person whose prompt starts the same way. The catch: reusing saved work is noticeably faster than doing it fresh. So if an attacker sends a guess and the answer comes back unusually quickly, that speed-up is a tell β it hints that someone else recently sent the same opening. By trying guess after guess and watching the clock, an attacker can slowly piece together fragments of what other people typed. Nothing is 'hacked' in the usual sense; the leak rides on a performance shortcut that was supposed to be invisible.
Risks this case illustrates
Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.
How it unfolded
A shared serving stack caches prompt prefixes for speed
The model is run by a serving system shared by many users at once. To save money and time, it keeps the work it did for common openings of prompts and reuses it β so the next person whose prompt starts the same way gets an answer faster. This shortcut is meant to be a pure win, invisible to users.
serving: prefix_cache: enabled # reuse prefill for shared prefixes cache_scope: GLOBAL # shared ACROSS users/tenants match: longest_prefix # extend a cached prefix segment-by-segment goal: maximise GPU utilisation; cut TTFT and cost # the optimisation is correct; the SHARING SCOPE is the latent leak
Controls & guardrails β what would have stopped it
The fix that actually closes this is to stop sharing the shortcut between different users. If each user (or tenant) gets their own private cache β or cross-user sharing is simply turned off where confidentiality matters β then a fast answer no longer tells the attacker anything about anyone else, because there is nothing of anyone else's to hit. You can also pad timing so hits and misses look the same, but the clean fix is isolation. The honest cost: less sharing means slower, pricier serving, which is exactly why sharing was on in the first place.
- Serving-stack & provisioning attestation, cache isolation
Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign β an insider with signing rights still needs review and trigger-focused evals.
- Runtime monitoring & anomaly detection
Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.
- Full-trace audit logging
Logging is forensic, not preventive β it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.
- Governance: risk assessment, red-teaming & incident response
Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.
Lessons
- βΈ Performance optimisations that SHARE state across users are confidentiality boundaries in disguise: a cache hit is faster than a miss, and that latency gap is an oracle for what other users recently submitted.
- βΈ App-layer guardrails (input/output filters, instruction hierarchy) do nothing here β the leak lives in the serving stack, below the layer those controls operate on.
- βΈ The clean fix is isolation (per-tenant cache partitioning / no cross-tenant prefix sharing), which removes the oracle; timing-padding only shrinks it, and detection only spots the probing pattern after it starts.
- βΈ Attest the stack, not just the artifact: weight hashes are unchanged while the system leaks, so attestation must cover the serving binary and its cache-sharing configuration.
- βΈ It is a classic shared-resource side channel (Flush+Reload-style) reappearing in LLM serving β old side-channel discipline applies to new GPU-serving infrastructure.
Sources
- The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems (arXiv:2409.20002) β
- InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks (arXiv:2411.18191) β
- Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference (arXiv:2508.08438) β
- The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems (arXiv:2409.20002) β β Timing side channels from prefix/KV caching and batching in LLM serving.
- InputSnatch: Stealing Input in LLM Services via Timing Side-Channel Attacks (arXiv:2411.18191) β β Reconstructs other users' inputs via cache hit/miss timing; the named exemplar.
- Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference (arXiv:2508.08438) β β Mitigation that narrows the channel by trading away some cross-request reuse; not a full close.