Files

Gahow Wang 10636b1ab1 KV cache lifecycle design + eviction loss analysis

Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.

Three approaches designed:
  A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
  B. Two-tier KV cache: GPU + DRAM offload via Mooncake
  C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)

Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 01:27:22 +08:00

9.4 KiB

Raw Blame History

KV Cache Lifecycle Management for Agentic Workloads

Date: 2026-05-22 Status: Design, pending implementation + experiments Context: PD separation's real issue is not P-D compute interference (cache-aware routing solves that), but KV cache eviction destroying multi-turn session state. This design addresses the root cause directly.

1. Problem: Multi-Turn KV Eviction

Our eviction analysis on 1000 sampled requests shows:

  Infinite cache APC:   53.4%
  LRU cache APC:        43.3%
  Gap:                  10.1 pp (3.2M tokens lost)
  
  Loss breakdown:
    Multi-turn (prior turn evicted): 66% of loss (6.7pp)
    Cross-session (shared prefix):   34% of loss (3.4pp)

The mechanism: a multi-turn session completes turn N (KV = ~93 blocks p50, ~47k tokens). Before turn N+1 arrives (gap = 2 requests p50, 12 requests p90), cold-start requests fill the LRU cache and evict turn N's KV. Turn N+1 arrives, finds zero cache hit, must re-prefill the entire context.

Key workload parameters:

Metric	Value
Multi-turn sessions	9% of sessions, but 66% of eviction loss
Inter-turn gap	p50=2 req, p90=12 req (very short)
KV to protect per session	p50=93 blocks (47k tokens)
Concurrent sessions needing protection	p50=14, max=21
Total protection budget needed	p50=3515 blocks (6.4x single-instance capacity)
Per-instance capacity	550 blocks

The challenge: protecting all concurrent multi-turn sessions' KV requires 3515 blocks, but each instance only has 550. Even spreading across 8 instances (4400 total blocks), it's tight at peak (4927 blocks needed).

2. Three Design Approaches

Approach A: Session-Sticky Routing with KV Reservation

Idea: Route all turns of a multi-turn session to the same instance. Reserve a fraction of each instance's KV cache for "protected" multi-turn sessions.

Instance KV layout (550 blocks):
  ┌──────────────────────────────────────────┐
  │  Protected zone (200 blocks)             │  ← Multi-turn session KV
  │  LRU eviction disabled here              │  ← Pinned by session affinity
  ├──────────────────────────────────────────┤
  │  Evictable zone (350 blocks)             │  ← Cold-start + overflow
  │  Normal LRU eviction                     │
  └──────────────────────────────────────────┘

Routing: Cache-aware + session-sticky. Multi-turn turn 2+ goes to the instance that served turn 1. Load-balance new sessions across instances.

KV protection: Not a vLLM change — implemented at the routing level. By concentrating a session's turns on one instance and ensuring the instance has enough cache headroom, the session's KV stays warm naturally (inter-turn gap is only 2 requests p50).

Budget: 21 concurrent multi-turn sessions / 8 instances ≈ 3 sessions per instance. At 93 blocks/session, that's ~280 blocks protected, leaving 270 blocks for cold starts.

Pros: No vLLM modification. Pure routing optimization. Cons: Instance load imbalance if multi-turn sessions cluster. Protected blocks may waste cache if session ends unexpectedly.

Experiment: Compare combined cache-aware (current) vs combined with aggressive session-sticky routing where multi-turn sessions are balanced across instances by their KV size.

Approach B: Two-Tier KV Cache (GPU + DRAM Offload)

Idea: When a multi-turn session's turn completes, offload its KV from GPU to DRAM. When the next turn arrives, reload from DRAM (faster than re-prefill). GPU cache is freed for cold starts.

  Turn N completes:
    GPU KV (hot) ──offload──> DRAM KV pool (warm)
    GPU cache freed for cold-start requests
  
  Turn N+1 arrives:
    DRAM KV pool ──reload──> GPU KV (hot)
    Skip prefill, go directly to decode
    
  Latency: DRAM reload ~1-10ms (PCIe/RDMA) vs re-prefill ~3-10s (compute)

Implementation: Use Mooncake's DRAM pool as a KV cache extension. Each instance runs with kv_role=kv_both. When the scheduler detects a turn completion for a multi-turn session, it triggers KV offload to DRAM. On next turn arrival, the scheduler triggers KV reload.

Budget: DRAM is much larger than GPU HBM. Each H20 has ~512GB system DRAM. 21 sessions × 93 blocks × 512 tokens × 48 layers × 2(K+V) × 128 dim × 2 bytes ≈ 24GB in DRAM — easily fits.

Pros: Decouples KV cache capacity from GPU HBM. DRAM reload is 100-1000x faster than re-prefill. Cons: Requires Mooncake integration. Offload/reload adds latency (but much less than re-prefill). vLLM changes needed for proactive offload trigger.

Experiment: Hard to implement quickly in vLLM. Can simulate the benefit by comparing: (a) current APC with eviction vs (b) APC if multi-turn sessions always hit cache (simulated infinite cache for multi-turn only).

Approach C: Prefill-Aware Eviction Policy

Idea: Replace LRU with a policy that considers session lifecycle. Blocks belonging to active multi-turn sessions get eviction priority boost.

  Standard LRU: evict oldest accessed block
  Session-aware: evict oldest accessed block THAT IS NOT part of an active session
  
  Active session: session with turn completed in last T seconds (or N requests)

Implementation: Modify vLLM's prefix cache eviction in third_party/vllm/. The eviction policy checks if a block's hash belongs to a known active session before evicting it.

The problem: vLLM's prefix cache uses block hashes, not session IDs. There's no direct mapping from block → session. We'd need to maintain a mapping at the scheduler level.

Alternative: Simpler proxy — just use block access frequency instead of pure LRU. Blocks that belong to system prompts (accessed by many requests) and multi-turn sessions (accessed repeatedly) naturally have higher frequency and survive eviction. This is LFU (Least Frequently Used) or ARC (Adaptive Replacement Cache).

Pros: Directly solves eviction at the cache layer. No routing changes needed. Cons: Requires vLLM source modification. Cache policy changes are subtle and may have side effects.

Experiment: Simulate LFU vs LRU on the trace to estimate APC improvement before implementing.

3. Feasibility and Experiment Priority

Approach	Implementation Effort	vLLM Changes	Expected APC Gain	Experiment
A: Session-sticky	Low (proxy only)	None	+3-5pp (multi-turn stays warm)	Run immediately
B: DRAM offload	High (Mooncake)	Medium	+6-7pp (all multi-turn recovered)	Simulate first
C: Eviction policy	Medium (vLLM patch)	Yes	+5-10pp (both MT and cross-session)	Simulate LFU vs LRU first

Recommended experiment order:

Simulate: LRU vs LFU vs "infinite-for-MT" on the trace → quantify upper bound
Approach A: Session-sticky routing with KV-size-balanced placement → real benchmark
Approach C: If simulation shows LFU helps, patch vLLM eviction policy → real benchmark
Approach B: If DRAM offload shows large benefit in simulation, implement with Mooncake

4. Relationship to PD Separation

These approaches are orthogonal to PD separation. They address KV cache lifecycle, not P-D compute interference:

Approach A works in combined mode (no PD-Sep needed)
Approach B could complement PD-Sep (offload from D to DRAM between turns)
Approach C works in any mode

The key insight: for agentic workloads, KV cache management is a more impactful optimization axis than P-D compute separation. The 10.1pp APC gap from eviction translates to ~3.2M extra tokens of re-prefill per 1000 requests — far more overhead than P-D interference.

5. Combined Architecture Vision

The endgame combines all insights:

  ┌──────────────────────────────────────────────┐
  │            Global Scheduler                   │
  │  - Cache-aware + token-level LB               │
  │  - Session-sticky for multi-turn              │
  │  - KV-size-aware placement                    │
  └──────────────┬───────────────────────────────┘
                 │
  ┌──────────────┴───────────────────────────────┐
  │  8× PD-Combined Instances (TP=1)              │
  │                                               │
  │  Per-instance KV cache:                       │
  │    [Session-protected zone] [LFU evictable]   │
  │                                               │
  │  DRAM KV pool (Mooncake):                     │
  │    - Offloaded between-turn KV                │
  │    - Shared prefix blocks (system prompt)     │
  │    - Overflow buffer                          │
  └───────────────────────────────────────────────┘

All 8 GPUs do both P and D. The scheduler, cache policy, and DRAM pool work together to maximize APC and minimize prefill work — which is the real bottleneck for agentic workloads.

9.4 KiB Raw Blame History Unescape Escape