agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	a2111b6e18	PD-disagg docs: annotated corrections for `e13391e` contamination Adds dated, non-destructive correction notes to the contaminated PD-vs-colo artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on `finished_sending`, deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) was found and gated behind `VLLM_EVICT_SENT_BLOCKS` (default off). PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker. §6 (session-affinity hot-pin) was an `e13391e` artifact under controlled concurrency; §3 RR, §4 TPOT win, §5 D-pool ceiling, §5.1 consumer crash stand. RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms the D-pool capacity thesis and adds regime- dependence. pd_separation_analysis.md scoped caution: thesis confirmed; flags only reuse-dependent figures for cross-check (this study used a different stack). figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted; §3 RR and §5 D-pool figures stand.	2026-05-31 20:14:14 +08:00
Gahow Wang	a2f2645fda	PD_DISAGG_RESULTS §6.3: producer hot-pinning figure Direct per-producer KV-pool evidence for the session-affinity backfire. At the same 4P+4D ratio: - round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01) - session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25) A 25x jump in producer load imbalance — heavy multi-turn sessions concentrate onto single producers, the same hot-pinning pathology as sticky routing in the colocated §3.3 study. plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs session comparison) — same two-stage pattern as aggregate_mb5.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:38:20 +08:00
Gahow Wang	6243b78bba	PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD Swept session-affinity P routing (MB5_P_ROUTING=session) across all four ratios on the metrics-fixed stack. Findings: - Strictly worse than round-robin at every ratio. 4P+4D: round-robin 100% vs session-affinity 36% completion. - Success DECREASES monotonically as decode capacity grows (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the "session prefill is faster so it needs more D" hypothesis. - GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster stalls on KV-transfer/admission coordination, not compute. This is the deepest anti-PD argument: paid-for hardware does nothing while requests pile up; colocation keeps every GPU busy. - Mechanism: session-affinity pins heavy multi-turn sessions onto single producers (producer hot-pinning, same pathology as sticky routing in the colocated §3.3 study); fewer producers -> worse concentration -> the monotonic decline. Failed transfers also pin producer KV (kv_load_failure_policy=fail), compounding to deadlock. Verdict: neither ratio tuning nor routing policy rescues static PD-disagg for this agentic workload — the failure is structural. mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:25:10 +08:00
Gahow Wang	2e6a369046	PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token KV transfer fails -> negative prompt-token counter -> prometheus ValueError -> engine dead -> cliff failure). Explains the round-robin 6P+2D rep variance (100/56/80%) as intermittent consumer death, and notes the counter-clamp patch needed to compare routing arms fairly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:02:21 +08:00
Gahow Wang	8596135680	MB5 analysis: per-role KV split proves static-partition mismatch aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:05:17 +08:00

5 Commits