KV cache lifecycle design + eviction loss analysis
Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between turns by cold-start prefills (66% of loss). Inter-turn gap is only 2 requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session across 14-21 concurrent sessions. Three approaches designed: A. Session-sticky routing with KV reservation (proxy-only, no vLLM change) B. Two-tier KV cache: GPU + DRAM offload via Mooncake C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch) Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds, then implement Approach A (lowest effort, immediate benchmark). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
163
analysis/kv_lifecycle_design.md
Normal file
163
analysis/kv_lifecycle_design.md
Normal file
@@ -0,0 +1,163 @@
|
|||||||
|
# KV Cache Lifecycle Management for Agentic Workloads
|
||||||
|
|
||||||
|
**Date**: 2026-05-22
|
||||||
|
**Status**: Design, pending implementation + experiments
|
||||||
|
**Context**: PD separation's real issue is not P-D compute interference (cache-aware routing solves that), but KV cache eviction destroying multi-turn session state. This design addresses the root cause directly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Problem: Multi-Turn KV Eviction
|
||||||
|
|
||||||
|
Our eviction analysis on 1000 sampled requests shows:
|
||||||
|
|
||||||
|
```
|
||||||
|
Infinite cache APC: 53.4%
|
||||||
|
LRU cache APC: 43.3%
|
||||||
|
Gap: 10.1 pp (3.2M tokens lost)
|
||||||
|
|
||||||
|
Loss breakdown:
|
||||||
|
Multi-turn (prior turn evicted): 66% of loss (6.7pp)
|
||||||
|
Cross-session (shared prefix): 34% of loss (3.4pp)
|
||||||
|
```
|
||||||
|
|
||||||
|
The mechanism: a multi-turn session completes turn N (KV = ~93 blocks p50, ~47k tokens). Before turn N+1 arrives (gap = 2 requests p50, 12 requests p90), cold-start requests fill the LRU cache and evict turn N's KV. Turn N+1 arrives, finds zero cache hit, must re-prefill the entire context.
|
||||||
|
|
||||||
|
Key workload parameters:
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| Multi-turn sessions | 9% of sessions, but 66% of eviction loss |
|
||||||
|
| Inter-turn gap | p50=2 req, p90=12 req (very short) |
|
||||||
|
| KV to protect per session | p50=93 blocks (47k tokens) |
|
||||||
|
| Concurrent sessions needing protection | p50=14, max=21 |
|
||||||
|
| Total protection budget needed | p50=3515 blocks (6.4x single-instance capacity) |
|
||||||
|
| Per-instance capacity | 550 blocks |
|
||||||
|
|
||||||
|
The challenge: protecting all concurrent multi-turn sessions' KV requires 3515 blocks, but each instance only has 550. Even spreading across 8 instances (4400 total blocks), it's tight at peak (4927 blocks needed).
|
||||||
|
|
||||||
|
## 2. Three Design Approaches
|
||||||
|
|
||||||
|
### Approach A: Session-Sticky Routing with KV Reservation
|
||||||
|
|
||||||
|
**Idea**: Route all turns of a multi-turn session to the same instance. Reserve a fraction of each instance's KV cache for "protected" multi-turn sessions.
|
||||||
|
|
||||||
|
```
|
||||||
|
Instance KV layout (550 blocks):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ Protected zone (200 blocks) │ ← Multi-turn session KV
|
||||||
|
│ LRU eviction disabled here │ ← Pinned by session affinity
|
||||||
|
├──────────────────────────────────────────┤
|
||||||
|
│ Evictable zone (350 blocks) │ ← Cold-start + overflow
|
||||||
|
│ Normal LRU eviction │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Routing**: Cache-aware + session-sticky. Multi-turn turn 2+ goes to the instance that served turn 1. Load-balance new sessions across instances.
|
||||||
|
|
||||||
|
**KV protection**: Not a vLLM change — implemented at the routing level. By concentrating a session's turns on one instance and ensuring the instance has enough cache headroom, the session's KV stays warm naturally (inter-turn gap is only 2 requests p50).
|
||||||
|
|
||||||
|
**Budget**: 21 concurrent multi-turn sessions / 8 instances ≈ 3 sessions per instance. At 93 blocks/session, that's ~280 blocks protected, leaving 270 blocks for cold starts.
|
||||||
|
|
||||||
|
**Pros**: No vLLM modification. Pure routing optimization.
|
||||||
|
**Cons**: Instance load imbalance if multi-turn sessions cluster. Protected blocks may waste cache if session ends unexpectedly.
|
||||||
|
|
||||||
|
**Experiment**: Compare combined cache-aware (current) vs combined with aggressive session-sticky routing where multi-turn sessions are balanced across instances by their KV size.
|
||||||
|
|
||||||
|
### Approach B: Two-Tier KV Cache (GPU + DRAM Offload)
|
||||||
|
|
||||||
|
**Idea**: When a multi-turn session's turn completes, offload its KV from GPU to DRAM. When the next turn arrives, reload from DRAM (faster than re-prefill). GPU cache is freed for cold starts.
|
||||||
|
|
||||||
|
```
|
||||||
|
Turn N completes:
|
||||||
|
GPU KV (hot) ──offload──> DRAM KV pool (warm)
|
||||||
|
GPU cache freed for cold-start requests
|
||||||
|
|
||||||
|
Turn N+1 arrives:
|
||||||
|
DRAM KV pool ──reload──> GPU KV (hot)
|
||||||
|
Skip prefill, go directly to decode
|
||||||
|
|
||||||
|
Latency: DRAM reload ~1-10ms (PCIe/RDMA) vs re-prefill ~3-10s (compute)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementation**: Use Mooncake's DRAM pool as a KV cache extension. Each instance runs with `kv_role=kv_both`. When the scheduler detects a turn completion for a multi-turn session, it triggers KV offload to DRAM. On next turn arrival, the scheduler triggers KV reload.
|
||||||
|
|
||||||
|
**Budget**: DRAM is much larger than GPU HBM. Each H20 has ~512GB system DRAM. 21 sessions × 93 blocks × 512 tokens × 48 layers × 2(K+V) × 128 dim × 2 bytes ≈ 24GB in DRAM — easily fits.
|
||||||
|
|
||||||
|
**Pros**: Decouples KV cache capacity from GPU HBM. DRAM reload is 100-1000x faster than re-prefill.
|
||||||
|
**Cons**: Requires Mooncake integration. Offload/reload adds latency (but much less than re-prefill). vLLM changes needed for proactive offload trigger.
|
||||||
|
|
||||||
|
**Experiment**: Hard to implement quickly in vLLM. Can simulate the benefit by comparing: (a) current APC with eviction vs (b) APC if multi-turn sessions always hit cache (simulated infinite cache for multi-turn only).
|
||||||
|
|
||||||
|
### Approach C: Prefill-Aware Eviction Policy
|
||||||
|
|
||||||
|
**Idea**: Replace LRU with a policy that considers session lifecycle. Blocks belonging to active multi-turn sessions get eviction priority boost.
|
||||||
|
|
||||||
|
```
|
||||||
|
Standard LRU: evict oldest accessed block
|
||||||
|
Session-aware: evict oldest accessed block THAT IS NOT part of an active session
|
||||||
|
|
||||||
|
Active session: session with turn completed in last T seconds (or N requests)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementation**: Modify vLLM's prefix cache eviction in `third_party/vllm/`. The eviction policy checks if a block's hash belongs to a known active session before evicting it.
|
||||||
|
|
||||||
|
**The problem**: vLLM's prefix cache uses block hashes, not session IDs. There's no direct mapping from block → session. We'd need to maintain a mapping at the scheduler level.
|
||||||
|
|
||||||
|
**Alternative**: Simpler proxy — just use **block access frequency** instead of pure LRU. Blocks that belong to system prompts (accessed by many requests) and multi-turn sessions (accessed repeatedly) naturally have higher frequency and survive eviction. This is **LFU (Least Frequently Used)** or **ARC (Adaptive Replacement Cache)**.
|
||||||
|
|
||||||
|
**Pros**: Directly solves eviction at the cache layer. No routing changes needed.
|
||||||
|
**Cons**: Requires vLLM source modification. Cache policy changes are subtle and may have side effects.
|
||||||
|
|
||||||
|
**Experiment**: Simulate LFU vs LRU on the trace to estimate APC improvement before implementing.
|
||||||
|
|
||||||
|
## 3. Feasibility and Experiment Priority
|
||||||
|
|
||||||
|
| Approach | Implementation Effort | vLLM Changes | Expected APC Gain | Experiment |
|
||||||
|
|----------|----------------------|-------------|-------------------|------------|
|
||||||
|
| **A: Session-sticky** | Low (proxy only) | None | +3-5pp (multi-turn stays warm) | Run immediately |
|
||||||
|
| **B: DRAM offload** | High (Mooncake) | Medium | +6-7pp (all multi-turn recovered) | Simulate first |
|
||||||
|
| **C: Eviction policy** | Medium (vLLM patch) | Yes | +5-10pp (both MT and cross-session) | Simulate LFU vs LRU first |
|
||||||
|
|
||||||
|
### Recommended experiment order:
|
||||||
|
|
||||||
|
1. **Simulate**: LRU vs LFU vs "infinite-for-MT" on the trace → quantify upper bound
|
||||||
|
2. **Approach A**: Session-sticky routing with KV-size-balanced placement → real benchmark
|
||||||
|
3. **Approach C**: If simulation shows LFU helps, patch vLLM eviction policy → real benchmark
|
||||||
|
4. **Approach B**: If DRAM offload shows large benefit in simulation, implement with Mooncake
|
||||||
|
|
||||||
|
## 4. Relationship to PD Separation
|
||||||
|
|
||||||
|
These approaches are **orthogonal to PD separation**. They address KV cache lifecycle, not P-D compute interference:
|
||||||
|
|
||||||
|
- **Approach A** works in combined mode (no PD-Sep needed)
|
||||||
|
- **Approach B** could complement PD-Sep (offload from D to DRAM between turns)
|
||||||
|
- **Approach C** works in any mode
|
||||||
|
|
||||||
|
The key insight: **for agentic workloads, KV cache management is a more impactful optimization axis than P-D compute separation.** The 10.1pp APC gap from eviction translates to ~3.2M extra tokens of re-prefill per 1000 requests — far more overhead than P-D interference.
|
||||||
|
|
||||||
|
## 5. Combined Architecture Vision
|
||||||
|
|
||||||
|
The endgame combines all insights:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────┐
|
||||||
|
│ Global Scheduler │
|
||||||
|
│ - Cache-aware + token-level LB │
|
||||||
|
│ - Session-sticky for multi-turn │
|
||||||
|
│ - KV-size-aware placement │
|
||||||
|
└──────────────┬───────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌──────────────┴───────────────────────────────┐
|
||||||
|
│ 8× PD-Combined Instances (TP=1) │
|
||||||
|
│ │
|
||||||
|
│ Per-instance KV cache: │
|
||||||
|
│ [Session-protected zone] [LFU evictable] │
|
||||||
|
│ │
|
||||||
|
│ DRAM KV pool (Mooncake): │
|
||||||
|
│ - Offloaded between-turn KV │
|
||||||
|
│ - Shared prefix blocks (system prompt) │
|
||||||
|
│ - Overflow buffer │
|
||||||
|
└───────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
All 8 GPUs do both P and D. The scheduler, cache policy, and DRAM pool work together to maximize APC and minimize prefill work — which is the real bottleneck for agentic workloads.
|
||||||
184
scripts/analyze_eviction.py
Normal file
184
scripts/analyze_eviction.py
Normal file
@@ -0,0 +1,184 @@
|
|||||||
|
"""Analyze the 10pp APC gap: what gets evicted and why."""
|
||||||
|
import json
|
||||||
|
from collections import OrderedDict
|
||||||
|
|
||||||
|
rows = [json.loads(l) for l in open("traces/sampled_1000req_seed42.jsonl")]
|
||||||
|
rows.sort(key=lambda r: float(r["timestamp"]))
|
||||||
|
|
||||||
|
BLOCK_SIZE = 512
|
||||||
|
KV_CAPACITY_BLOCKS = 550
|
||||||
|
N_INSTANCES = 8
|
||||||
|
|
||||||
|
class LRUCache:
|
||||||
|
def __init__(self, cap):
|
||||||
|
self.cap = cap
|
||||||
|
self.cache = OrderedDict()
|
||||||
|
self.evictions = 0
|
||||||
|
def peek(self, k):
|
||||||
|
return k in self.cache
|
||||||
|
def access(self, k):
|
||||||
|
if k in self.cache:
|
||||||
|
self.cache.move_to_end(k)
|
||||||
|
return True
|
||||||
|
self.cache[k] = True
|
||||||
|
while len(self.cache) > self.cap:
|
||||||
|
self.cache.popitem(last=False)
|
||||||
|
self.evictions += 1
|
||||||
|
return False
|
||||||
|
|
||||||
|
inf_seen = [set() for _ in range(N_INSTANCES)]
|
||||||
|
lru_caches = [LRUCache(KV_CAPACITY_BLOCKS) for _ in range(N_INSTANCES)]
|
||||||
|
session_aff = {}
|
||||||
|
chat_to_session = {}
|
||||||
|
|
||||||
|
loss_intra = 0 # multi-turn: prior turn evicted
|
||||||
|
loss_cross = 0 # single-turn: shared prefix evicted
|
||||||
|
total_loss = 0
|
||||||
|
total_inf_hits = 0
|
||||||
|
total_lru_hits = 0
|
||||||
|
total_tokens = 0
|
||||||
|
per_req = []
|
||||||
|
|
||||||
|
for idx, r in enumerate(rows):
|
||||||
|
il = r["input_length"]
|
||||||
|
hids = r.get("hash_ids", [])
|
||||||
|
cid = r["chat_id"]
|
||||||
|
pid = r["parent_chat_id"]
|
||||||
|
sid = r.get("session_id", str(cid) if pid < 0 else chat_to_session.get(pid, str(pid)))
|
||||||
|
chat_to_session[cid] = str(sid)
|
||||||
|
is_mt = pid >= 0
|
||||||
|
|
||||||
|
if sid in session_aff:
|
||||||
|
inst = session_aff[sid]
|
||||||
|
else:
|
||||||
|
best_inst, best_h = 0, 0
|
||||||
|
for j in range(N_INSTANCES):
|
||||||
|
h = sum(1 for hid in hids[:10] if hid in lru_caches[j].cache)
|
||||||
|
if h > best_h:
|
||||||
|
best_h = h
|
||||||
|
best_inst = j
|
||||||
|
inst = best_inst
|
||||||
|
session_aff[sid] = inst
|
||||||
|
|
||||||
|
# Infinite
|
||||||
|
inf_h = 0
|
||||||
|
for hid in hids:
|
||||||
|
if hid in inf_seen[inst]:
|
||||||
|
inf_h += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
for hid in hids:
|
||||||
|
inf_seen[inst].add(hid)
|
||||||
|
|
||||||
|
# LRU
|
||||||
|
lru_h = 0
|
||||||
|
for hid in hids:
|
||||||
|
if lru_caches[inst].peek(hid):
|
||||||
|
lru_caches[inst].access(hid)
|
||||||
|
lru_h += 1
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
for hid in hids:
|
||||||
|
lru_caches[inst].access(hid)
|
||||||
|
|
||||||
|
inf_tok = inf_h * BLOCK_SIZE
|
||||||
|
lru_tok = lru_h * BLOCK_SIZE
|
||||||
|
loss = inf_tok - lru_tok
|
||||||
|
|
||||||
|
total_inf_hits += inf_tok
|
||||||
|
total_lru_hits += lru_tok
|
||||||
|
total_tokens += il
|
||||||
|
|
||||||
|
if loss > 0:
|
||||||
|
total_loss += loss
|
||||||
|
if is_mt:
|
||||||
|
loss_intra += loss
|
||||||
|
else:
|
||||||
|
loss_cross += loss
|
||||||
|
per_req.append({
|
||||||
|
"idx": idx, "input": il, "inf_hit": inf_h, "lru_hit": lru_h,
|
||||||
|
"loss_blocks": inf_h - lru_h, "loss_tok": loss,
|
||||||
|
"mt": is_mt, "sid": sid, "turn": r.get("turn", 1),
|
||||||
|
"n_blocks": len(hids),
|
||||||
|
})
|
||||||
|
|
||||||
|
sep = "=" * 70
|
||||||
|
print(sep)
|
||||||
|
print(" EVICTION LOSS ANALYSIS")
|
||||||
|
print(sep)
|
||||||
|
print()
|
||||||
|
print(" Infinite APC: %.1f%%" % (total_inf_hits / total_tokens * 100))
|
||||||
|
print(" LRU APC: %.1f%%" % (total_lru_hits / total_tokens * 100))
|
||||||
|
print(" Gap: %.1f pp (%s tokens lost)" % (
|
||||||
|
(total_inf_hits - total_lru_hits) / total_tokens * 100,
|
||||||
|
"{:,}".format(total_loss)))
|
||||||
|
print()
|
||||||
|
print(" Loss by type:")
|
||||||
|
print(" Multi-turn (prior turn KV evicted): %s tok (%.0f%%)" % (
|
||||||
|
"{:,}".format(loss_intra), loss_intra * 100 / max(total_loss, 1)))
|
||||||
|
print(" Single-turn (shared prefix evicted): %s tok (%.0f%%)" % (
|
||||||
|
"{:,}".format(loss_cross), loss_cross * 100 / max(total_loss, 1)))
|
||||||
|
print()
|
||||||
|
print(" Requests with loss: %d / %d" % (len(per_req), len(rows)))
|
||||||
|
|
||||||
|
print()
|
||||||
|
print(" Top-15 by loss:")
|
||||||
|
print(" %4s %7s %5s %5s %5s %7s %3s %8s %4s" % (
|
||||||
|
"#", "input", "inf_h", "lru_h", "loss", "tok", "mt", "session", "turn"))
|
||||||
|
for r in sorted(per_req, key=lambda x: -x["loss_tok"])[:15]:
|
||||||
|
print(" %4d %7d %5d %5d %5d %7d %3s %8s %4d" % (
|
||||||
|
r["idx"], r["input"], r["inf_hit"], r["lru_hit"],
|
||||||
|
r["loss_blocks"], r["loss_tok"],
|
||||||
|
"Y" if r["mt"] else "N", r["sid"][:8], r["turn"]))
|
||||||
|
|
||||||
|
# Instance-level analysis
|
||||||
|
print()
|
||||||
|
print(" Per-instance:")
|
||||||
|
for i in range(N_INSTANCES):
|
||||||
|
n = len(inf_seen[i])
|
||||||
|
e = lru_caches[i].evictions
|
||||||
|
overflow = n / KV_CAPACITY_BLOCKS
|
||||||
|
print(" inst_%d: %5d unique blocks, overflow=%.1fx, evictions=%d" % (
|
||||||
|
i, n, overflow, e))
|
||||||
|
|
||||||
|
# Time gap analysis: for lost requests, how long between
|
||||||
|
# the block being deposited and being needed again?
|
||||||
|
print()
|
||||||
|
print(" Temporal analysis of evicted blocks:")
|
||||||
|
# Track when each block was last inserted, per instance
|
||||||
|
block_deposit_time = [{} for _ in range(N_INSTANCES)]
|
||||||
|
gaps = []
|
||||||
|
|
||||||
|
# Re-scan
|
||||||
|
session_aff2 = {}
|
||||||
|
chat_to_session2 = {}
|
||||||
|
for idx, r in enumerate(rows):
|
||||||
|
hids = r.get("hash_ids", [])
|
||||||
|
cid = r["chat_id"]
|
||||||
|
pid = r["parent_chat_id"]
|
||||||
|
sid = r.get("session_id", str(cid) if pid < 0 else chat_to_session2.get(pid, str(pid)))
|
||||||
|
chat_to_session2[cid] = str(sid)
|
||||||
|
if sid in session_aff2:
|
||||||
|
inst = session_aff2[sid]
|
||||||
|
else:
|
||||||
|
inst = 0 # simplified
|
||||||
|
session_aff2[sid] = inst
|
||||||
|
|
||||||
|
for hid in hids:
|
||||||
|
if hid in block_deposit_time[inst]:
|
||||||
|
gap = idx - block_deposit_time[inst][hid]
|
||||||
|
gaps.append(gap)
|
||||||
|
block_deposit_time[inst][hid] = idx
|
||||||
|
|
||||||
|
if gaps:
|
||||||
|
gaps.sort()
|
||||||
|
p = lambda q: gaps[min(int(q * len(gaps)), len(gaps) - 1)]
|
||||||
|
print(" Block reuse distance (requests between deposit and reaccess):")
|
||||||
|
print(" p10=%d p50=%d p90=%d max=%d" % (p(.1), p(.5), p(.9), max(gaps)))
|
||||||
|
short = sum(1 for g in gaps if g <= 10)
|
||||||
|
medium = sum(1 for g in gaps if 10 < g <= 100)
|
||||||
|
long_ = sum(1 for g in gaps if g > 100)
|
||||||
|
print(" <=10 req: %d (%.0f%%) 10-100: %d (%.0f%%) >100: %d (%.0f%%)" % (
|
||||||
|
short, short * 100 / len(gaps),
|
||||||
|
medium, medium * 100 / len(gaps),
|
||||||
|
long_, long_ * 100 / len(gaps)))
|
||||||
Reference in New Issue
Block a user