agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	a2111b6e18	PD-disagg docs: annotated corrections for `e13391e` contamination Adds dated, non-destructive correction notes to the contaminated PD-vs-colo artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on `finished_sending`, deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) was found and gated behind `VLLM_EVICT_SENT_BLOCKS` (default off). PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker. §6 (session-affinity hot-pin) was an `e13391e` artifact under controlled concurrency; §3 RR, §4 TPOT win, §5 D-pool ceiling, §5.1 consumer crash stand. RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms the D-pool capacity thesis and adds regime- dependence. pd_separation_analysis.md scoped caution: thesis confirmed; flags only reuse-dependent figures for cross-check (this study used a different stack). figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted; §3 RR and §5 D-pool figures stand.	2026-05-31 20:14:14 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00
Gahow Wang	2b0ac70ee7	Phase 1 milestone: system-level analysis + reproducible report - REPORT.md: self-contained milestone report covering baseline vs elastic setup, exact launch commands, benchmark params, results, log locations, and repo structure — sufficient for anyone to reproduce - analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown (KV cache hit ratio, per-class TTFT, GPU util paradox explanation) - scripts/cache_aware_proxy.py: round-robin P-instance selection replacing argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x) - scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:17:41 +08:00
Gahow Wang	efa70f05b5	Consolidate analysis into single report with appendix Merged roofline_analysis.md into pd_separation_analysis.md. Restructured as a self-contained research report: 1. TL;DR with key finding (KV cache memory wall) 2. Workload characterization (trace stats + cache reuse) 3. Experiment setup (hardware, software, configs, scripts) 4. Results (main comparison, GPU util, breakdown, ablations) 5. Analysis (DistServe assumptions, roofline, root cause) 6. Conclusions 7. Appendix: all experiment artifacts, data paths, reproducing steps One document to read, with pointers to data for deeper analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:23:23 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

5 Commits