agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	e45f00eb68	Cache policy simulation: routing quality dominates, not eviction policy With balanced session-sticky routing: LRU APC = 49.2% (only 1.8pp below infinite 51.0%) LFU APC = 43.5% (worse than LRU!) SessionProtLRU = 49.0% (no improvement) The previous 10.1pp gap was from routing imbalance (all traffic to inst_0), not from cache eviction policy. Balanced routing recovers 5.9pp of the gap. Multi-turn sessions get 80.1% APC with simple LRU + session-sticky routing because inter-turn gap is only 2 requests (LRU naturally keeps it warm). Conclusion: fix routing balance, not cache policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:28:53 +08:00
Gahow Wang	10636b1ab1	KV cache lifecycle design + eviction loss analysis Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between turns by cold-start prefills (66% of loss). Inter-turn gap is only 2 requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session across 14-21 concurrent sessions. Three approaches designed: A. Session-sticky routing with KV reservation (proxy-only, no vLLM change) B. Two-tier KV cache: GPU + DRAM offload via Mooncake C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch) Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds, then implement Approach A (lowest effort, immediate benchmark). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:27:22 +08:00
Gahow Wang	d11d9f5cb9	Adaptive prefill offload v1: implementation + experiment Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new tokens >= threshold) route to instance with least decode load; WARM/MEDIUM route by cache-hit + token-level LB as before. Result: no significant difference vs baseline on single-machine combined mode. TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise) Per-class TTFT breakdown shows the optimization target: WARM (75 req): p50=0.198s (cache hit, nearly free) MEDIUM (72 req): p50=1.356s HEAVY (54 req): p50=7.124s (36x slower than WARM) Conclusion: single-machine combined mode already distributes load well enough that adaptive routing adds no benefit. True isolation of HEAVY prefills requires cross-machine offload (v2 with Mooncake or multi-node). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:00:10 +08:00
Gahow Wang	d6e47d3742	Design doc: Adaptive Prefill Offload All 8 GPUs stay PD-combined. Global scheduler classifies requests as WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache. Only HEAVY requests (20%, cold start >20k new tokens) get offloaded; 80% of requests are co-located with zero KV transfer. This avoids the KV cache memory wall (no decode concentration) while isolating heavy prefills from decode when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:44:22 +08:00
Gahow Wang	445e491123	Add vLLM v0.18.1 source tree with KV transfer abort fix third_party/vllm/ now tracked in git for direct patch management. Based on vLLM v0.18.1 release with one patch applied: vllm/v1/core/sched/scheduler.py: Replace fatal assert with graceful skip when KV transfer callback arrives for an already-aborted request during PD disaggregated serving. Future vLLM modifications should be made directly in third_party/vllm/ and committed normally. The patches/ directory is kept as documentation of what changed from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:30:38 +08:00
Gahow Wang	b6591950bc	Add vLLM patches directory for version-controlled patch management patches/0001-fix-kv-transfer-abort-race.patch: Fix scheduler assert crash when KV transfer callback arrives after request abort in PD-disaggregated serving. patches/README.md: How to apply patches to source tree or installed package. Per-patch description with problem/fix/impact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:26:14 +08:00
Gahow Wang	efa70f05b5	Consolidate analysis into single report with appendix Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:23:23 +08:00
Gahow Wang	ce616f46d1	Add per-request breakdown profiling, identify KV cache memory bottleneck Breakdown profiling at proxy level captures: t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill. Root cause: decode instance KV cache memory saturation (97.1% usage). With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache. Large agentic requests (avg 33.6k tokens) fill this quickly. Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache to be freed by large requests completing decode. vLLM log confirms: Running=0, Waiting=6, KV cache=97.1% GPU is idle but requests queue for KV cache memory, not compute. This is the fundamental bottleneck of single-machine PD separation for long-context agentic workloads: concentrating decode onto fewer GPUs creates a KV cache memory wall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:13:50 +08:00
Gahow Wang	c7afdc5074	Ablation 2: fire-and-forget vs await-prefill scheduling Added --fire-and-forget flag to cache_aware_proxy.py for async prefill dispatch. Results on 6P+2D config: Await: TTFT=1.48s TPOT=0.066s E2E=5.95s 94% success FnF: TTFT=5.32s TPOT=0.037s E2E=11.9s 85% success Fire-and-forget improves TPOT by 44% (pipeline overlap) but degrades TTFT by 260% (decode internally waits for KV, less efficiently than proxy-level await) and increases errors from KV race conditions. Full 4-way ablation summary in analyze_ablations.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:02:42 +08:00
Gahow Wang	9dee25907b	Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined 6P+2D gives more GPUs to prefill, fewer to decode: - Decode util: 7.8% (4D) -> 19.0% (2D), less waste - TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing - But Combined (30.5% util, TTFT 1.01s) still best overall Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:42:20 +08:00
Gahow Wang	67149130be	Add GPU utilization A/B test and fix cache-aware proxy bugs - GPU monitor: 5s interval nvidia-smi sampling during benchmarks - A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep - Fixed proxy: await bootstrap init (race condition), normalized LB scoring - Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%) - Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted) - Prefill GPUs: active only 17% of samples (bursty, idle between requests) - Combined: 8 GPUs flexibly used, mean=30.5%, active=64% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:13:38 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

12 Commits