β Scenario library
Overheard Through the Cache
A speed optimisation becomes a cross-tenant listening device
Technique first revealed 30 Sep 2024
Inside the Model
InstructionsDataActionsControl / decisionFeedback / logs
π Click a component to inspectSetupStep 1 / 6
A shared cache for speed
The service answers thousands of people at once. To stay fast, it remembers the work it already did for the opening words of a request, so if the next request starts the same way it can skip ahead instead of redoing it.
βοΈServing config (prefix cache enabled)config
serving:
prefix_cache:
enabled: true
scope: global # shared across ALL tenants on this host
key: sha256(token_ids[:prefix_len])
eviction: lru
# On match: reuse cached KV for the shared prefix, skip prefill.
# Effect: time-to-first-token drops sharply on a cache HIT.β / β keys