From 10636b1ab1616075ee8bf3387c58c50019af467a Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Fri, 22 May 2026 01:27:22 +0800 Subject: [PATCH] KV cache lifecycle design + eviction loss analysis Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between turns by cold-start prefills (66% of loss). Inter-turn gap is only 2 requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session across 14-21 concurrent sessions. Three approaches designed: A. Session-sticky routing with KV reservation (proxy-only, no vLLM change) B. Two-tier KV cache: GPU + DRAM offload via Mooncake C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch) Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds, then implement Approach A (lowest effort, immediate benchmark). Co-Authored-By: Claude Opus 4.6 (1M context) --- analysis/kv_lifecycle_design.md | 163 ++++++++++++++++++++++++++++ scripts/analyze_eviction.py | 184 ++++++++++++++++++++++++++++++++ 2 files changed, 347 insertions(+) create mode 100644 analysis/kv_lifecycle_design.md create mode 100644 scripts/analyze_eviction.py diff --git a/analysis/kv_lifecycle_design.md b/analysis/kv_lifecycle_design.md new file mode 100644 index 0000000..6099473 --- /dev/null +++ b/analysis/kv_lifecycle_design.md @@ -0,0 +1,163 @@ +# KV Cache Lifecycle Management for Agentic Workloads + +**Date**: 2026-05-22 +**Status**: Design, pending implementation + experiments +**Context**: PD separation's real issue is not P-D compute interference (cache-aware routing solves that), but KV cache eviction destroying multi-turn session state. This design addresses the root cause directly. + +--- + +## 1. Problem: Multi-Turn KV Eviction + +Our eviction analysis on 1000 sampled requests shows: + +``` + Infinite cache APC: 53.4% + LRU cache APC: 43.3% + Gap: 10.1 pp (3.2M tokens lost) + + Loss breakdown: + Multi-turn (prior turn evicted): 66% of loss (6.7pp) + Cross-session (shared prefix): 34% of loss (3.4pp) +``` + +The mechanism: a multi-turn session completes turn N (KV = ~93 blocks p50, ~47k tokens). Before turn N+1 arrives (gap = 2 requests p50, 12 requests p90), cold-start requests fill the LRU cache and evict turn N's KV. Turn N+1 arrives, finds zero cache hit, must re-prefill the entire context. + +Key workload parameters: + +| Metric | Value | +|--------|-------| +| Multi-turn sessions | 9% of sessions, but 66% of eviction loss | +| Inter-turn gap | p50=2 req, p90=12 req (very short) | +| KV to protect per session | p50=93 blocks (47k tokens) | +| Concurrent sessions needing protection | p50=14, max=21 | +| Total protection budget needed | p50=3515 blocks (6.4x single-instance capacity) | +| Per-instance capacity | 550 blocks | + +The challenge: protecting all concurrent multi-turn sessions' KV requires 3515 blocks, but each instance only has 550. Even spreading across 8 instances (4400 total blocks), it's tight at peak (4927 blocks needed). + +## 2. Three Design Approaches + +### Approach A: Session-Sticky Routing with KV Reservation + +**Idea**: Route all turns of a multi-turn session to the same instance. Reserve a fraction of each instance's KV cache for "protected" multi-turn sessions. + +``` +Instance KV layout (550 blocks): + ┌──────────────────────────────────────────┐ + │ Protected zone (200 blocks) │ ← Multi-turn session KV + │ LRU eviction disabled here │ ← Pinned by session affinity + ├──────────────────────────────────────────┤ + │ Evictable zone (350 blocks) │ ← Cold-start + overflow + │ Normal LRU eviction │ + └──────────────────────────────────────────┘ +``` + +**Routing**: Cache-aware + session-sticky. Multi-turn turn 2+ goes to the instance that served turn 1. Load-balance new sessions across instances. + +**KV protection**: Not a vLLM change — implemented at the routing level. By concentrating a session's turns on one instance and ensuring the instance has enough cache headroom, the session's KV stays warm naturally (inter-turn gap is only 2 requests p50). + +**Budget**: 21 concurrent multi-turn sessions / 8 instances ≈ 3 sessions per instance. At 93 blocks/session, that's ~280 blocks protected, leaving 270 blocks for cold starts. + +**Pros**: No vLLM modification. Pure routing optimization. +**Cons**: Instance load imbalance if multi-turn sessions cluster. Protected blocks may waste cache if session ends unexpectedly. + +**Experiment**: Compare combined cache-aware (current) vs combined with aggressive session-sticky routing where multi-turn sessions are balanced across instances by their KV size. + +### Approach B: Two-Tier KV Cache (GPU + DRAM Offload) + +**Idea**: When a multi-turn session's turn completes, offload its KV from GPU to DRAM. When the next turn arrives, reload from DRAM (faster than re-prefill). GPU cache is freed for cold starts. + +``` + Turn N completes: + GPU KV (hot) ──offload──> DRAM KV pool (warm) + GPU cache freed for cold-start requests + + Turn N+1 arrives: + DRAM KV pool ──reload──> GPU KV (hot) + Skip prefill, go directly to decode + + Latency: DRAM reload ~1-10ms (PCIe/RDMA) vs re-prefill ~3-10s (compute) +``` + +**Implementation**: Use Mooncake's DRAM pool as a KV cache extension. Each instance runs with `kv_role=kv_both`. When the scheduler detects a turn completion for a multi-turn session, it triggers KV offload to DRAM. On next turn arrival, the scheduler triggers KV reload. + +**Budget**: DRAM is much larger than GPU HBM. Each H20 has ~512GB system DRAM. 21 sessions × 93 blocks × 512 tokens × 48 layers × 2(K+V) × 128 dim × 2 bytes ≈ 24GB in DRAM — easily fits. + +**Pros**: Decouples KV cache capacity from GPU HBM. DRAM reload is 100-1000x faster than re-prefill. +**Cons**: Requires Mooncake integration. Offload/reload adds latency (but much less than re-prefill). vLLM changes needed for proactive offload trigger. + +**Experiment**: Hard to implement quickly in vLLM. Can simulate the benefit by comparing: (a) current APC with eviction vs (b) APC if multi-turn sessions always hit cache (simulated infinite cache for multi-turn only). + +### Approach C: Prefill-Aware Eviction Policy + +**Idea**: Replace LRU with a policy that considers session lifecycle. Blocks belonging to active multi-turn sessions get eviction priority boost. + +``` + Standard LRU: evict oldest accessed block + Session-aware: evict oldest accessed block THAT IS NOT part of an active session + + Active session: session with turn completed in last T seconds (or N requests) +``` + +**Implementation**: Modify vLLM's prefix cache eviction in `third_party/vllm/`. The eviction policy checks if a block's hash belongs to a known active session before evicting it. + +**The problem**: vLLM's prefix cache uses block hashes, not session IDs. There's no direct mapping from block → session. We'd need to maintain a mapping at the scheduler level. + +**Alternative**: Simpler proxy — just use **block access frequency** instead of pure LRU. Blocks that belong to system prompts (accessed by many requests) and multi-turn sessions (accessed repeatedly) naturally have higher frequency and survive eviction. This is **LFU (Least Frequently Used)** or **ARC (Adaptive Replacement Cache)**. + +**Pros**: Directly solves eviction at the cache layer. No routing changes needed. +**Cons**: Requires vLLM source modification. Cache policy changes are subtle and may have side effects. + +**Experiment**: Simulate LFU vs LRU on the trace to estimate APC improvement before implementing. + +## 3. Feasibility and Experiment Priority + +| Approach | Implementation Effort | vLLM Changes | Expected APC Gain | Experiment | +|----------|----------------------|-------------|-------------------|------------| +| **A: Session-sticky** | Low (proxy only) | None | +3-5pp (multi-turn stays warm) | Run immediately | +| **B: DRAM offload** | High (Mooncake) | Medium | +6-7pp (all multi-turn recovered) | Simulate first | +| **C: Eviction policy** | Medium (vLLM patch) | Yes | +5-10pp (both MT and cross-session) | Simulate LFU vs LRU first | + +### Recommended experiment order: + +1. **Simulate**: LRU vs LFU vs "infinite-for-MT" on the trace → quantify upper bound +2. **Approach A**: Session-sticky routing with KV-size-balanced placement → real benchmark +3. **Approach C**: If simulation shows LFU helps, patch vLLM eviction policy → real benchmark +4. **Approach B**: If DRAM offload shows large benefit in simulation, implement with Mooncake + +## 4. Relationship to PD Separation + +These approaches are **orthogonal to PD separation**. They address KV cache lifecycle, not P-D compute interference: + +- **Approach A** works in combined mode (no PD-Sep needed) +- **Approach B** could complement PD-Sep (offload from D to DRAM between turns) +- **Approach C** works in any mode + +The key insight: **for agentic workloads, KV cache management is a more impactful optimization axis than P-D compute separation.** The 10.1pp APC gap from eviction translates to ~3.2M extra tokens of re-prefill per 1000 requests — far more overhead than P-D interference. + +## 5. Combined Architecture Vision + +The endgame combines all insights: + +``` + ┌──────────────────────────────────────────────┐ + │ Global Scheduler │ + │ - Cache-aware + token-level LB │ + │ - Session-sticky for multi-turn │ + │ - KV-size-aware placement │ + └──────────────┬───────────────────────────────┘ + │ + ┌──────────────┴───────────────────────────────┐ + │ 8× PD-Combined Instances (TP=1) │ + │ │ + │ Per-instance KV cache: │ + │ [Session-protected zone] [LFU evictable] │ + │ │ + │ DRAM KV pool (Mooncake): │ + │ - Offloaded between-turn KV │ + │ - Shared prefix blocks (system prompt) │ + │ - Overflow buffer │ + └───────────────────────────────────────────────┘ +``` + +All 8 GPUs do both P and D. The scheduler, cache policy, and DRAM pool work together to maximize APC and minimize prefill work — which is the real bottleneck for agentic workloads. diff --git a/scripts/analyze_eviction.py b/scripts/analyze_eviction.py new file mode 100644 index 0000000..f2d6bf0 --- /dev/null +++ b/scripts/analyze_eviction.py @@ -0,0 +1,184 @@ +"""Analyze the 10pp APC gap: what gets evicted and why.""" +import json +from collections import OrderedDict + +rows = [json.loads(l) for l in open("traces/sampled_1000req_seed42.jsonl")] +rows.sort(key=lambda r: float(r["timestamp"])) + +BLOCK_SIZE = 512 +KV_CAPACITY_BLOCKS = 550 +N_INSTANCES = 8 + +class LRUCache: + def __init__(self, cap): + self.cap = cap + self.cache = OrderedDict() + self.evictions = 0 + def peek(self, k): + return k in self.cache + def access(self, k): + if k in self.cache: + self.cache.move_to_end(k) + return True + self.cache[k] = True + while len(self.cache) > self.cap: + self.cache.popitem(last=False) + self.evictions += 1 + return False + +inf_seen = [set() for _ in range(N_INSTANCES)] +lru_caches = [LRUCache(KV_CAPACITY_BLOCKS) for _ in range(N_INSTANCES)] +session_aff = {} +chat_to_session = {} + +loss_intra = 0 # multi-turn: prior turn evicted +loss_cross = 0 # single-turn: shared prefix evicted +total_loss = 0 +total_inf_hits = 0 +total_lru_hits = 0 +total_tokens = 0 +per_req = [] + +for idx, r in enumerate(rows): + il = r["input_length"] + hids = r.get("hash_ids", []) + cid = r["chat_id"] + pid = r["parent_chat_id"] + sid = r.get("session_id", str(cid) if pid < 0 else chat_to_session.get(pid, str(pid))) + chat_to_session[cid] = str(sid) + is_mt = pid >= 0 + + if sid in session_aff: + inst = session_aff[sid] + else: + best_inst, best_h = 0, 0 + for j in range(N_INSTANCES): + h = sum(1 for hid in hids[:10] if hid in lru_caches[j].cache) + if h > best_h: + best_h = h + best_inst = j + inst = best_inst + session_aff[sid] = inst + + # Infinite + inf_h = 0 + for hid in hids: + if hid in inf_seen[inst]: + inf_h += 1 + else: + break + for hid in hids: + inf_seen[inst].add(hid) + + # LRU + lru_h = 0 + for hid in hids: + if lru_caches[inst].peek(hid): + lru_caches[inst].access(hid) + lru_h += 1 + else: + break + for hid in hids: + lru_caches[inst].access(hid) + + inf_tok = inf_h * BLOCK_SIZE + lru_tok = lru_h * BLOCK_SIZE + loss = inf_tok - lru_tok + + total_inf_hits += inf_tok + total_lru_hits += lru_tok + total_tokens += il + + if loss > 0: + total_loss += loss + if is_mt: + loss_intra += loss + else: + loss_cross += loss + per_req.append({ + "idx": idx, "input": il, "inf_hit": inf_h, "lru_hit": lru_h, + "loss_blocks": inf_h - lru_h, "loss_tok": loss, + "mt": is_mt, "sid": sid, "turn": r.get("turn", 1), + "n_blocks": len(hids), + }) + +sep = "=" * 70 +print(sep) +print(" EVICTION LOSS ANALYSIS") +print(sep) +print() +print(" Infinite APC: %.1f%%" % (total_inf_hits / total_tokens * 100)) +print(" LRU APC: %.1f%%" % (total_lru_hits / total_tokens * 100)) +print(" Gap: %.1f pp (%s tokens lost)" % ( + (total_inf_hits - total_lru_hits) / total_tokens * 100, + "{:,}".format(total_loss))) +print() +print(" Loss by type:") +print(" Multi-turn (prior turn KV evicted): %s tok (%.0f%%)" % ( + "{:,}".format(loss_intra), loss_intra * 100 / max(total_loss, 1))) +print(" Single-turn (shared prefix evicted): %s tok (%.0f%%)" % ( + "{:,}".format(loss_cross), loss_cross * 100 / max(total_loss, 1))) +print() +print(" Requests with loss: %d / %d" % (len(per_req), len(rows))) + +print() +print(" Top-15 by loss:") +print(" %4s %7s %5s %5s %5s %7s %3s %8s %4s" % ( + "#", "input", "inf_h", "lru_h", "loss", "tok", "mt", "session", "turn")) +for r in sorted(per_req, key=lambda x: -x["loss_tok"])[:15]: + print(" %4d %7d %5d %5d %5d %7d %3s %8s %4d" % ( + r["idx"], r["input"], r["inf_hit"], r["lru_hit"], + r["loss_blocks"], r["loss_tok"], + "Y" if r["mt"] else "N", r["sid"][:8], r["turn"])) + +# Instance-level analysis +print() +print(" Per-instance:") +for i in range(N_INSTANCES): + n = len(inf_seen[i]) + e = lru_caches[i].evictions + overflow = n / KV_CAPACITY_BLOCKS + print(" inst_%d: %5d unique blocks, overflow=%.1fx, evictions=%d" % ( + i, n, overflow, e)) + +# Time gap analysis: for lost requests, how long between +# the block being deposited and being needed again? +print() +print(" Temporal analysis of evicted blocks:") +# Track when each block was last inserted, per instance +block_deposit_time = [{} for _ in range(N_INSTANCES)] +gaps = [] + +# Re-scan +session_aff2 = {} +chat_to_session2 = {} +for idx, r in enumerate(rows): + hids = r.get("hash_ids", []) + cid = r["chat_id"] + pid = r["parent_chat_id"] + sid = r.get("session_id", str(cid) if pid < 0 else chat_to_session2.get(pid, str(pid))) + chat_to_session2[cid] = str(sid) + if sid in session_aff2: + inst = session_aff2[sid] + else: + inst = 0 # simplified + session_aff2[sid] = inst + + for hid in hids: + if hid in block_deposit_time[inst]: + gap = idx - block_deposit_time[inst][hid] + gaps.append(gap) + block_deposit_time[inst][hid] = idx + +if gaps: + gaps.sort() + p = lambda q: gaps[min(int(q * len(gaps)), len(gaps) - 1)] + print(" Block reuse distance (requests between deposit and reaccess):") + print(" p10=%d p50=%d p90=%d max=%d" % (p(.1), p(.5), p(.9), max(gaps))) + short = sum(1 for g in gaps if g <= 10) + medium = sum(1 for g in gaps if 10 < g <= 100) + long_ = sum(1 for g in gaps if g > 100) + print(" <=10 req: %d (%.0f%%) 10-100: %d (%.0f%%) >100: %d (%.0f%%)" % ( + short, short * 100 / len(gaps), + medium, medium * 100 / len(gaps), + long_, long_ * 100 / len(gaps)))