Files
agentic-kvc/analysis/kv_lifecycle_design.md
Gahow Wang 10636b1ab1 KV cache lifecycle design + eviction loss analysis
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between
turns by cold-start prefills (66% of loss). Inter-turn gap is only 2
requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session
across 14-21 concurrent sessions.

Three approaches designed:
  A. Session-sticky routing with KV reservation (proxy-only, no vLLM change)
  B. Two-tier KV cache: GPU + DRAM offload via Mooncake
  C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch)

Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds,
then implement Approach A (lowest effort, immediate benchmark).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 01:27:22 +08:00

164 lines
9.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# KV Cache Lifecycle Management for Agentic Workloads
**Date**: 2026-05-22
**Status**: Design, pending implementation + experiments
**Context**: PD separation's real issue is not P-D compute interference (cache-aware routing solves that), but KV cache eviction destroying multi-turn session state. This design addresses the root cause directly.
---
## 1. Problem: Multi-Turn KV Eviction
Our eviction analysis on 1000 sampled requests shows:
```
Infinite cache APC: 53.4%
LRU cache APC: 43.3%
Gap: 10.1 pp (3.2M tokens lost)
Loss breakdown:
Multi-turn (prior turn evicted): 66% of loss (6.7pp)
Cross-session (shared prefix): 34% of loss (3.4pp)
```
The mechanism: a multi-turn session completes turn N (KV = ~93 blocks p50, ~47k tokens). Before turn N+1 arrives (gap = 2 requests p50, 12 requests p90), cold-start requests fill the LRU cache and evict turn N's KV. Turn N+1 arrives, finds zero cache hit, must re-prefill the entire context.
Key workload parameters:
| Metric | Value |
|--------|-------|
| Multi-turn sessions | 9% of sessions, but 66% of eviction loss |
| Inter-turn gap | p50=2 req, p90=12 req (very short) |
| KV to protect per session | p50=93 blocks (47k tokens) |
| Concurrent sessions needing protection | p50=14, max=21 |
| Total protection budget needed | p50=3515 blocks (6.4x single-instance capacity) |
| Per-instance capacity | 550 blocks |
The challenge: protecting all concurrent multi-turn sessions' KV requires 3515 blocks, but each instance only has 550. Even spreading across 8 instances (4400 total blocks), it's tight at peak (4927 blocks needed).
## 2. Three Design Approaches
### Approach A: Session-Sticky Routing with KV Reservation
**Idea**: Route all turns of a multi-turn session to the same instance. Reserve a fraction of each instance's KV cache for "protected" multi-turn sessions.
```
Instance KV layout (550 blocks):
┌──────────────────────────────────────────┐
│ Protected zone (200 blocks) │ ← Multi-turn session KV
│ LRU eviction disabled here │ ← Pinned by session affinity
├──────────────────────────────────────────┤
│ Evictable zone (350 blocks) │ ← Cold-start + overflow
│ Normal LRU eviction │
└──────────────────────────────────────────┘
```
**Routing**: Cache-aware + session-sticky. Multi-turn turn 2+ goes to the instance that served turn 1. Load-balance new sessions across instances.
**KV protection**: Not a vLLM change — implemented at the routing level. By concentrating a session's turns on one instance and ensuring the instance has enough cache headroom, the session's KV stays warm naturally (inter-turn gap is only 2 requests p50).
**Budget**: 21 concurrent multi-turn sessions / 8 instances ≈ 3 sessions per instance. At 93 blocks/session, that's ~280 blocks protected, leaving 270 blocks for cold starts.
**Pros**: No vLLM modification. Pure routing optimization.
**Cons**: Instance load imbalance if multi-turn sessions cluster. Protected blocks may waste cache if session ends unexpectedly.
**Experiment**: Compare combined cache-aware (current) vs combined with aggressive session-sticky routing where multi-turn sessions are balanced across instances by their KV size.
### Approach B: Two-Tier KV Cache (GPU + DRAM Offload)
**Idea**: When a multi-turn session's turn completes, offload its KV from GPU to DRAM. When the next turn arrives, reload from DRAM (faster than re-prefill). GPU cache is freed for cold starts.
```
Turn N completes:
GPU KV (hot) ──offload──> DRAM KV pool (warm)
GPU cache freed for cold-start requests
Turn N+1 arrives:
DRAM KV pool ──reload──> GPU KV (hot)
Skip prefill, go directly to decode
Latency: DRAM reload ~1-10ms (PCIe/RDMA) vs re-prefill ~3-10s (compute)
```
**Implementation**: Use Mooncake's DRAM pool as a KV cache extension. Each instance runs with `kv_role=kv_both`. When the scheduler detects a turn completion for a multi-turn session, it triggers KV offload to DRAM. On next turn arrival, the scheduler triggers KV reload.
**Budget**: DRAM is much larger than GPU HBM. Each H20 has ~512GB system DRAM. 21 sessions × 93 blocks × 512 tokens × 48 layers × 2(K+V) × 128 dim × 2 bytes ≈ 24GB in DRAM — easily fits.
**Pros**: Decouples KV cache capacity from GPU HBM. DRAM reload is 100-1000x faster than re-prefill.
**Cons**: Requires Mooncake integration. Offload/reload adds latency (but much less than re-prefill). vLLM changes needed for proactive offload trigger.
**Experiment**: Hard to implement quickly in vLLM. Can simulate the benefit by comparing: (a) current APC with eviction vs (b) APC if multi-turn sessions always hit cache (simulated infinite cache for multi-turn only).
### Approach C: Prefill-Aware Eviction Policy
**Idea**: Replace LRU with a policy that considers session lifecycle. Blocks belonging to active multi-turn sessions get eviction priority boost.
```
Standard LRU: evict oldest accessed block
Session-aware: evict oldest accessed block THAT IS NOT part of an active session
Active session: session with turn completed in last T seconds (or N requests)
```
**Implementation**: Modify vLLM's prefix cache eviction in `third_party/vllm/`. The eviction policy checks if a block's hash belongs to a known active session before evicting it.
**The problem**: vLLM's prefix cache uses block hashes, not session IDs. There's no direct mapping from block → session. We'd need to maintain a mapping at the scheduler level.
**Alternative**: Simpler proxy — just use **block access frequency** instead of pure LRU. Blocks that belong to system prompts (accessed by many requests) and multi-turn sessions (accessed repeatedly) naturally have higher frequency and survive eviction. This is **LFU (Least Frequently Used)** or **ARC (Adaptive Replacement Cache)**.
**Pros**: Directly solves eviction at the cache layer. No routing changes needed.
**Cons**: Requires vLLM source modification. Cache policy changes are subtle and may have side effects.
**Experiment**: Simulate LFU vs LRU on the trace to estimate APC improvement before implementing.
## 3. Feasibility and Experiment Priority
| Approach | Implementation Effort | vLLM Changes | Expected APC Gain | Experiment |
|----------|----------------------|-------------|-------------------|------------|
| **A: Session-sticky** | Low (proxy only) | None | +3-5pp (multi-turn stays warm) | Run immediately |
| **B: DRAM offload** | High (Mooncake) | Medium | +6-7pp (all multi-turn recovered) | Simulate first |
| **C: Eviction policy** | Medium (vLLM patch) | Yes | +5-10pp (both MT and cross-session) | Simulate LFU vs LRU first |
### Recommended experiment order:
1. **Simulate**: LRU vs LFU vs "infinite-for-MT" on the trace → quantify upper bound
2. **Approach A**: Session-sticky routing with KV-size-balanced placement → real benchmark
3. **Approach C**: If simulation shows LFU helps, patch vLLM eviction policy → real benchmark
4. **Approach B**: If DRAM offload shows large benefit in simulation, implement with Mooncake
## 4. Relationship to PD Separation
These approaches are **orthogonal to PD separation**. They address KV cache lifecycle, not P-D compute interference:
- **Approach A** works in combined mode (no PD-Sep needed)
- **Approach B** could complement PD-Sep (offload from D to DRAM between turns)
- **Approach C** works in any mode
The key insight: **for agentic workloads, KV cache management is a more impactful optimization axis than P-D compute separation.** The 10.1pp APC gap from eviction translates to ~3.2M extra tokens of re-prefill per 1000 requests — far more overhead than P-D interference.
## 5. Combined Architecture Vision
The endgame combines all insights:
```
┌──────────────────────────────────────────────┐
│ Global Scheduler │
│ - Cache-aware + token-level LB │
│ - Session-sticky for multi-turn │
│ - KV-size-aware placement │
└──────────────┬───────────────────────────────┘
┌──────────────┴───────────────────────────────┐
│ 8× PD-Combined Instances (TP=1) │
│ │
│ Per-instance KV cache: │
│ [Session-protected zone] [LFU evictable] │
│ │
│ DRAM KV pool (Mooncake): │
│ - Offloaded between-turn KV │
│ - Shared prefix blocks (system prompt) │
│ - Overflow buffer │
└───────────────────────────────────────────────┘
```
All 8 GPUs do both P and D. The scheduler, cache policy, and DRAM pool work together to maximize APC and minimize prefill work — which is the real bottleneck for agentic workloads.