πŸ”AI RiskAtlas
← Scenario library

Overheard Through the Cache

A speed optimisation becomes a cross-tenant listening device

Technique first revealed 30 Sep 2024

Inside the Model
Inference pipelineBelow the app layerraw texthosts / cachesπŸͺŸContext Windowβœ‚οΈTokenizerπŸ”’EmbeddingsπŸ”¦Attention + KVCache🧬Model Weights &Registry🎲Sampler /DecoderπŸ—οΈServingInfrastructureπŸ—οΈAttacker(co-tenant)
InstructionsDataActionsControl / decisionFeedback / logs
πŸ‘† Click a component to inspect
SetupStep 1 / 6

A shared cache for speed

The service answers thousands of people at once. To stay fast, it remembers the work it already did for the opening words of a request, so if the next request starts the same way it can skip ahead instead of redoing it.

βš™οΈServing config (prefix cache enabled)config
serving:
  prefix_cache:
    enabled: true
    scope: global        # shared across ALL tenants on this host
    key: sha256(token_ids[:prefix_len])
    eviction: lru
  # On match: reuse cached KV for the shared prefix, skip prefill.
  # Effect: time-to-first-token drops sharply on a cache HIT.

AI RiskAtlas is an educational model of how GenAI & agentic systems work and fail. Architectures and payloads are illustrative and simplified for learning β€” not operational guidance. Real-world cases are summarised from public reporting.

Sources & further reading β†’Β·Built by Shi Yuan β†—