Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

Technique first revealed 30 Sep 2024

🗺️ Inside the Model KV-Cache & Inference-State Side Channels Sensitive Data Leakage

Inside the Model

InstructionsDataActionsControl / decisionFeedback / logs

👆 Click a component to inspect

SetupStep 1 / 6

A shared cache for speed

The service answers thousands of people at once. To stay fast, it remembers the work it already did for the opening words of a request, so if the next request starts the same way it can skip ahead instead of redoing it.

⚙️Serving config (prefix cache enabled)config

serving:
  prefix_cache:
    enabled: true
    scope: global        # shared across ALL tenants on this host
    key: sha256(token_ids[:prefix_len])
    eviction: lru
  # On match: reuse cached KV for the shared prefix, skip prefill.
  # Effect: time-to-first-token drops sharply on a cache HIT.

← / → keys