agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	8e0c6e78b0	Add comprehensive research findings document Synthesizes all experiments into a paper-ready analysis: - Agentic workload characteristics vs chatbot/API - Why PD-Sep, LMetric, elastic RDMA, chunk-size tuning don't work - Why cache-aware session-sticky routing IS the key optimization (-60% TTFT, +24pp APC vs round-robin) - System-level insights: prefill-decode interference threshold, Mooncake limitations, effective request weight after cache - GPU balance → HEAVY TTFT -10.5% (demonstrated) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 07:16:31 +08:00
Gahow Wang	baf7ffb08c	16-session contention: TPOT +45% from prefill-decode interference Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%). This is the first time we've reproduced real prefill-decode interference in controlled experiments. Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency. Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show ~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not arrival rate. Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis. The real bottleneck is vLLM's chunked prefill scheduling, not routing or PD disaggregation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 05:51:47 +08:00
Gahow Wang	85b230455e	H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU imbalance (~3.5-4x across all settings). Root cause: imbalance is from workload skew at session placement (turn 1), not from routing at turn 2+. H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x), and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded requests have bimodal transfer times (0.6s or 18-31s) that negate the routing benefit. Updated elastic_hypotheses.md with H7 results and next directions: higher load experiments where contention amplifies routing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 03:04:02 +08:00
Gahow Wang	098d86385a	Add elastic hypotheses tracking doc with H1-H6 analysis Tracks all hypotheses tested during elastic PD disaggregation research: - H1 (kv_both overhead): REJECTED — zero overhead at idle - H2 (PS cold prefill): REJECTED — PS slower than cached C - H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117% - H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY - H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer - H6 (session migration): TODO — verify D's APC after migration Key insight: offload decision should be cache-aware (new_tokens), not size-based (total_input). 80k request with 90% cache = 8k prefill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 01:17:12 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00
Gahow Wang	2b0ac70ee7	Phase 1 milestone: system-level analysis + reproducible report - REPORT.md: self-contained milestone report covering baseline vs elastic setup, exact launch commands, benchmark params, results, log locations, and repo structure — sufficient for anyone to reproduce - analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown (KV cache hit ratio, per-class TTFT, GPU util paradox explanation) - scripts/cache_aware_proxy.py: round-robin P-instance selection replacing argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x) - scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:17:41 +08:00
Gahow Wang	1d2eeb4925	Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080) Design: offload HEAVY prefill only when P instance is less loaded than D AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D for future KV reuse. External KV correctly registered in prefix cache. Result (67/200 processed, 75% success): TTFT p50: 0.551s (-49% vs baseline 1.080s) TTFT p90: 4.135s (vs baseline 9.410s, -56%) TPOT p90: 0.074s (same as baseline) E2E p50: 2.938s (-45% vs baseline 5.306s) 25% error rate from ReadTimeout on very large HEAVY requests queuing on P. Needs stricter elastic gate or higher timeout. But successful requests show significant improvement over both baseline and previous P2P. Also: added external_prefix_cache metrics tracking to replayer summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:50:25 +08:00
Gahow Wang	a65ec42467	Update report: adaptive v2 confirms no KV transfer helps single-machine All PD/offload schemes tested are worse than PD-combined + hybrid routing: Combined hybrid: TTFT=0.737 TPOT90=0.072 APC=49.4% (BEST) PD-Sep 4P+4D: TTFT=1.994 TPOT90=0.075 APC=40.2% Adaptive v2 offload: TTFT=1.462 TPOT90=0.077 APC=~45% Definitive: single-machine agentic serving = PD-combined + smart routing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 10:15:08 +08:00
Gahow Wang	795edc6c66	Overnight work report: routing optimization achieves +4.7pp APC Summary of overnight autonomous session: - Analyzed agentic workload patterns (91% KV reuse is intra-session) - Simulated cache policies (LRU near-optimal, routing is the bottleneck) - Implemented hybrid routing (session-sticky + load-aware override) - Result: APC 44.7% -> 49.4% with zero latency regression Key insight: routing quality > cache policy > PD separation for single-machine agentic workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 02:54:48 +08:00
Gahow Wang	10636b1ab1	KV cache lifecycle design + eviction loss analysis Root cause of 10.1pp APC gap: multi-turn sessions' KV evicted between turns by cold-start prefills (66% of loss). Inter-turn gap is only 2 requests p50, but LRU cache (550 blocks) can't protect 93 blocks/session across 14-21 concurrent sessions. Three approaches designed: A. Session-sticky routing with KV reservation (proxy-only, no vLLM change) B. Two-tier KV cache: GPU + DRAM offload via Mooncake C. Prefill-aware eviction (LFU/ARC instead of LRU, vLLM patch) Next: simulate LRU vs LFU vs "infinite-for-MT" to quantify upper bounds, then implement Approach A (lowest effort, immediate benchmark). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:27:22 +08:00
Gahow Wang	d6e47d3742	Design doc: Adaptive Prefill Offload All 8 GPUs stay PD-combined. Global scheduler classifies requests as WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache. Only HEAVY requests (20%, cold start >20k new tokens) get offloaded; 80% of requests are co-located with zero KV transfer. This avoids the KV cache memory wall (no decode concentration) while isolating heavy prefills from decode when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:44:22 +08:00
Gahow Wang	efa70f05b5	Consolidate analysis into single report with appendix Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:23:23 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

13 Commits