agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	42bcd31976	TP=2 DP=4 + hybrid routing: best TTFT at cost of TPOT TP=2 DP=4 with hybrid routing achieves TTFT p50=0.611s (-43% vs TP=1), the best TTFT across all tested configurations. But TPOT p90=0.109s (+51% vs TP=1) due to cross-GPU all-reduce in decode. Full comparison across 7 configurations shows two Pareto-optimal points: TP=1 DP=8 hybrid: best TPOT (0.072s), good TTFT (1.064s) TP=2 DP=4 hybrid: best TTFT (0.611s), acceptable TPOT (0.109s) The choice depends on SLO: TTFT-sensitive (interactive) -> TP=2 DP=4 TPOT-sensitive (streaming) -> TP=1 DP=8 All PD-Sep configurations are strictly dominated by one of these two. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:35:18 +08:00
Gahow Wang	a65ec42467	Update report: adaptive v2 confirms no KV transfer helps single-machine All PD/offload schemes tested are worse than PD-combined + hybrid routing: Combined hybrid: TTFT=0.737 TPOT90=0.072 APC=49.4% (BEST) PD-Sep 4P+4D: TTFT=1.994 TPOT90=0.075 APC=40.2% Adaptive v2 offload: TTFT=1.462 TPOT90=0.077 APC=~45% Definitive: single-machine agentic serving = PD-combined + smart routing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:15:08 +08:00
Gahow Wang	2fee355626	Adaptive v2 (selective Mooncake offload): worse than baseline Implemented --offload mode: HEAVY requests (>20k new tokens) get P on least-loaded instance, KV via Mooncake RDMA, D on session-sticky instance. WARM/MEDIUM stay co-located (no KV transfer). All 8 instances run kv_both. Result (200 req, same instances, fresh restart): Baseline (no offload): TTFT=1.073 TPOT90=0.074 E2E=5.086 Offload HEAVY: TTFT=1.462 TPOT90=0.077 E2E=6.847 Delta: +36% +4% +35% Conclusion: even selective KV transfer (only 44% of requests) adds more overhead than the isolation benefit provides. On single-machine 8 GPU, PD-combined with hybrid routing is strictly optimal. No form of KV transfer — full PD-sep, selective offload, or otherwise — improves over co-located serving for this workload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:14:10 +08:00
Gahow Wang	4bf0b999ff	Final GPU comparison: hybrid routing matches baseline latency with better APC Complete 200-req comparison with GPU monitoring: Config TTFT50 TPOT90 E2E50 GPU% Active APC Combined (old cache-aware) 1.012 0.073 5.101 30.5% 64% 44.7% Combined (hybrid routing) 1.064 0.072 5.131 27.7% 60% 49.4% PD-Sep 4P+4D 1.994 0.075 7.112 12.4% 24% 40.2% PD-Sep 6P+2D 1.481 0.077 5.949 16.9% 28% ~37% Hybrid routing: +4.7pp APC with comparable latency and GPU utilization. PD-Sep: significantly worse on all dimensions for single-machine agentic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 03:14:05 +08:00
Gahow Wang	795edc6c66	Overnight work report: routing optimization achieves +4.7pp APC Summary of overnight autonomous session: - Analyzed agentic workload patterns (91% KV reuse is intra-session) - Simulated cache policies (LRU near-optimal, routing is the bottleneck) - Implemented hybrid routing (session-sticky + load-aware override) - Result: APC 44.7% -> 49.4% with zero latency regression Key insight: routing quality > cache policy > PD separation for single-machine agentic workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 02:54:48 +08:00
Gahow Wang	012d73f596	Hybrid routing: session-sticky + load-aware override achieves best results Session affinity for KV reuse, with load-aware override when pinned instance has ongoing_tokens > 2x average. Combines APC of sticky routing with latency of load-based routing. Results (1000 req, TP=1 DP=8 combined): TTFT50 TPOT90 E2E50 APC Old cache-aware 0.731 0.073 4.480 44.7% Balanced session-sticky 0.953 0.079 5.520 48.7% Hybrid (sticky+load-aware) 0.737 0.072 4.487 49.4% <- BEST Hybrid achieves +4.7pp APC improvement with zero latency regression. Session-sticky provides KV reuse; load-aware override prevents hotspots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 02:53:44 +08:00
Gahow Wang	efe984477a	Balanced routing result: APC +4pp but latency +23% (cache-load tradeoff) Balanced session-sticky routing improves APC from 44.7% to 48.7% (+4pp, close to simulated 49.2%) but TTFT worsens by 30% and E2E by 23%. Root cause: session-sticky creates load hotspots — some instances get multiple heavy concurrent sessions, causing queue delays, despite higher per-instance APC. Key finding: APC optimization and latency optimization are in tension. - Cache affinity (sticky) -> higher APC, worse load balance -> worse latency - Load-based routing (old) -> lower APC, better load balance -> better latency The optimal design must balance both dimensions, not optimize one at the expense of the other. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 02:13:15 +08:00
Gahow Wang	32f09d32cd	Balanced session-sticky routing + agentic workload pattern analysis Routing fix: new sessions placed by cumulative token load (greedy bin packing) with cache-hit tiebreak. Session affinity for turn 2+. Replayer now sends X-Session-Id header for proper session tracking. Agentic workload core patterns (GLM-5.1 trace): - 91% of reusable KV is intra-session (not cross-session) - Session-sticky routing is THE critical optimization - 36% warm requests (1.3k new tokens), 64% cold (17k+) - After cache: effective prefill/decode ratio drops from 61.5x to 28.7x - Cross-session sharing (system prompt) is only 4.8% of tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:50:27 +08:00
Gahow Wang	e45f00eb68	Cache policy simulation: routing quality dominates, not eviction policy With balanced session-sticky routing: LRU APC = 49.2% (only 1.8pp below infinite 51.0%) LFU APC = 43.5% (worse than LRU!) SessionProtLRU = 49.0% (no improvement) The previous 10.1pp gap was from routing imbalance (all traffic to inst_0), not from cache eviction policy. Balanced routing recovers 5.9pp of the gap. Multi-turn sessions get 80.1% APC with simple LRU + session-sticky routing because inter-turn gap is only 2 requests (LRU naturally keeps it warm). Conclusion: fix routing balance, not cache policy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:28:53 +08:00
Gahow Wang	10636b1ab1	KV cache lifecycle design + eviction loss analysis Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between turns by cold-start prefills (66% of loss). Inter-turn gap is only 2 requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session across 14-21 concurrent sessions. Three approaches designed: A. Session-sticky routing with KV reservation (proxy-only, no vLLM change) B. Two-tier KV cache: GPU + DRAM offload via Mooncake C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch) Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds, then implement Approach A (lowest effort, immediate benchmark). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:27:22 +08:00
Gahow Wang	d11d9f5cb9	Adaptive prefill offload v1: implementation + experiment Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new tokens >= threshold) route to instance with least decode load; WARM/MEDIUM route by cache-hit + token-level LB as before. Result: no significant difference vs baseline on single-machine combined mode. TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise) Per-class TTFT breakdown shows the optimization target: WARM (75 req): p50=0.198s (cache hit, nearly free) MEDIUM (72 req): p50=1.356s HEAVY (54 req): p50=7.124s (36x slower than WARM) Conclusion: single-machine combined mode already distributes load well enough that adaptive routing adds no benefit. True isolation of HEAVY prefills requires cross-machine offload (v2 with Mooncake or multi-node). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:00:10 +08:00
Gahow Wang	d6e47d3742	Design doc: Adaptive Prefill Offload All 8 GPUs stay PD-combined. Global scheduler classifies requests as WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache. Only HEAVY requests (20%, cold start >20k new tokens) get offloaded; 80% of requests are co-located with zero KV transfer. This avoids the KV cache memory wall (no decode concentration) while isolating heavy prefills from decode when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:44:22 +08:00
Gahow Wang	445e491123	Add vLLM v0.18.1 source tree with KV transfer abort fix third_party/vllm/ now tracked in git for direct patch management. Based on vLLM v0.18.1 release with one patch applied: vllm/v1/core/sched/scheduler.py: Replace fatal assert with graceful skip when KV transfer callback arrives for an already-aborted request during PD disaggregated serving. Future vLLM modifications should be made directly in third_party/vllm/ and committed normally. The patches/ directory is kept as documentation of what changed from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:30:38 +08:00
Gahow Wang	b6591950bc	Add vLLM patches directory for version-controlled patch management patches/0001-fix-kv-transfer-abort-race.patch: Fix scheduler assert crash when KV transfer callback arrives after request abort in PD-disaggregated serving. patches/README.md: How to apply patches to source tree or installed package. Per-patch description with problem/fix/impact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:26:14 +08:00
Gahow Wang	efa70f05b5	Consolidate analysis into single report with appendix Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:23:23 +08:00
Gahow Wang	ce616f46d1	Add per-request breakdown profiling, identify KV cache memory bottleneck Breakdown profiling at proxy level captures: t_proxy_recv → t_prefill_sent → t_prefill_done → t_decode_sent → t_first_token Key finding: 87.7% of TTFT is spent in kv+decode phase, NOT prefill. Root cause: decode instance KV cache memory saturation (97.1% usage). With 6P+2D config, 2 decode GPUs have only ~56GB total KV cache. Large agentic requests (avg 33.6k tokens) fill this quickly. Small requests (49 tokens, prefill=0.044s) wait 114s for KV cache to be freed by large requests completing decode. vLLM log confirms: Running=0, Waiting=6, KV cache=97.1% GPU is idle but requests queue for KV cache memory, not compute. This is the fundamental bottleneck of single-machine PD separation for long-context agentic workloads: concentrating decode onto fewer GPUs creates a KV cache memory wall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:13:50 +08:00
Gahow Wang	c7afdc5074	Ablation 2: fire-and-forget vs await-prefill scheduling Added --fire-and-forget flag to cache_aware_proxy.py for async prefill dispatch. Results on 6P+2D config: Await: TTFT=1.48s TPOT=0.066s E2E=5.95s 94% success FnF: TTFT=5.32s TPOT=0.037s E2E=11.9s 85% success Fire-and-forget improves TPOT by 44% (pipeline overlap) but degrades TTFT by 260% (decode internally waits for KV, less efficiently than proxy-level await) and increases errors from KV race conditions. Full 4-way ablation summary in analyze_ablations.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 23:02:42 +08:00
Gahow Wang	9dee25907b	Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined 6P+2D gives more GPUs to prefill, fewer to decode: - Decode util: 7.8% (4D) -> 19.0% (2D), less waste - TTFT: 1.99s (4P) -> 1.48s (6P), -26% from less prefill queuing - But Combined (30.5% util, TTFT 1.01s) still best overall Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:42:20 +08:00
Gahow Wang	67149130be	Add GPU utilization A/B test and fix cache-aware proxy bugs - GPU monitor: 5s interval nvidia-smi sampling during benchmarks - A/B test script: clean restart + monitor + benchmark for Combined vs PD-Sep - Fixed proxy: await bootstrap init (race condition), normalized LB scoring - Fixed port conflicts: proxy 9090 to avoid bootstrap 9000 clash Key finding: PD-Sep GPU utilization is 40% of Combined (12.4% vs 30.5%) - Decode GPUs: mean=7.8%, max=47% (memory-bound, compute wasted) - Prefill GPUs: active only 17% of samples (bursty, idle between requests) - Combined: 8 GPUs flexibly used, mean=30.5%, active=64% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 22:13:38 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

20 Commits