πŸ”AI RiskAtlas
← Real-world cases
Case study

Prefix/KV-cache timing side channels (e.g. InputSnatch)

Research demonstration27 Nov 2024πŸ—ΊοΈ Inside the Model

Shared prefix/KV caching in LLM serving leaks information about other users' inputs via response-timing side channels.

Root cause β€” why it happened

To answer faster and cheaper, the machinery running the model remembers the work it already did for a common opening of a prompt β€” and reuses it for the next person whose prompt starts the same way. The catch: reusing saved work is noticeably faster than doing it fresh. So if an attacker sends a guess and the answer comes back unusually quickly, that speed-up is a tell β€” it hints that someone else recently sent the same opening. By trying guess after guess and watching the clock, an attacker can slowly piece together fragments of what other people typed. Nothing is 'hacked' in the usual sense; the leak rides on a performance shortcut that was supposed to be invisible.

Risks this case illustrates

Named in the standard (OWASP/ATLAS/NIST) lens. Click a highlighted component in the diagram below to see which risks attach where.

How it unfolded

Inference pipelineBelow the app layerraw texthosts / cachesπŸͺŸContext Windowβœ‚οΈTokenizerπŸ”’EmbeddingsπŸ”¦Attention + KVCache🧬Model Weights &Registry🎲Sampler /DecoderπŸ—οΈServingInfrastructure🌐Attacker(co-tenant🌐Other user(victim
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect its risks
SetupStep 1 / 6

A shared serving stack caches prompt prefixes for speed

The model is run by a serving system shared by many users at once. To save money and time, it keeps the work it did for common openings of prompts and reuses it β€” so the next person whose prompt starts the same way gets an answer faster. This shortcut is meant to be a pure win, invisible to users.

βš™οΈServing config (illustrative)config
serving:
  prefix_cache: enabled        # reuse prefill for shared prefixes
  cache_scope: GLOBAL          # shared ACROSS users/tenants
  match: longest_prefix        # extend a cached prefix segment-by-segment
  goal: maximise GPU utilisation; cut TTFT and cost
# the optimisation is correct; the SHARING SCOPE is the latent leak
Step 1 / 6

Controls & guardrails β€” what would have stopped it

The fix that actually closes this is to stop sharing the shortcut between different users. If each user (or tenant) gets their own private cache β€” or cross-user sharing is simply turned off where confidentiality matters β€” then a fast answer no longer tells the attacker anything about anyone else, because there is nothing of anyone else's to hit. You can also pad timing so hits and misses look the same, but the clean fix is isolation. The honest cost: less sharing means slower, pricier serving, which is exactly why sharing was on in the first place.

Preventive
  • Serving-stack & provisioning attestation, cache isolation

    Attestation is operationally heavy and rarely covers the full stack; cache isolation trades away latency/cost savings, so it's often left on for performance. Signing proves a template wasn't tampered in transit, not that a signed template is benign β€” an insider with signing rights still needs review and trigger-focused evals.

Detective
  • Runtime monitoring & anomaly detection

    Detects the anomalous, not the novel-but-subtle; high false-positive rates cause alert fatigue. Always a step behind a sufficiently quiet attacker.

  • Full-trace audit logging

    Logging is forensic, not preventive β€” it explains harm after the fact. Useless if no one reviews it or if the materialised context isn't captured.

Corrective
  • Governance: risk assessment, red-teaming & incident response

    Process reduces likelihood and speeds recovery but executes no technical control itself; weak follow-through makes it theatre.

Lessons

  • β–Έ Performance optimisations that SHARE state across users are confidentiality boundaries in disguise: a cache hit is faster than a miss, and that latency gap is an oracle for what other users recently submitted.
  • β–Έ App-layer guardrails (input/output filters, instruction hierarchy) do nothing here β€” the leak lives in the serving stack, below the layer those controls operate on.
  • β–Έ The clean fix is isolation (per-tenant cache partitioning / no cross-tenant prefix sharing), which removes the oracle; timing-padding only shrinks it, and detection only spots the probing pattern after it starts.
  • β–Έ Attest the stack, not just the artifact: weight hashes are unchanged while the system leaks, so attestation must cover the serving binary and its cache-sharing configuration.
  • β–Έ It is a classic shared-resource side channel (Flush+Reload-style) reappearing in LLM serving β€” old side-channel discipline applies to new GPU-serving infrastructure.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—