5 Commits

Author SHA1 Message Date
a2111b6e18 PD-disagg docs: annotated corrections for e13391e contamination
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.
2026-05-31 20:14:14 +08:00
a2f2645fda PD_DISAGG_RESULTS §6.3: producer hot-pinning figure
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)

A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.

plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:38:20 +08:00
6243b78bba PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD
Swept session-affinity P routing (MB5_P_ROUTING=session) across all
four ratios on the metrics-fixed stack. Findings:

- Strictly worse than round-robin at every ratio. 4P+4D: round-robin
  100% vs session-affinity 36% completion.
- Success DECREASES monotonically as decode capacity grows
  (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the
  "session prefill is faster so it needs more D" hypothesis.
- GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster
  stalls on KV-transfer/admission coordination, not compute. This is
  the deepest anti-PD argument: paid-for hardware does nothing while
  requests pile up; colocation keeps every GPU busy.
- Mechanism: session-affinity pins heavy multi-turn sessions onto
  single producers (producer hot-pinning, same pathology as sticky
  routing in the colocated §3.3 study); fewer producers -> worse
  concentration -> the monotonic decline. Failed transfers also pin
  producer KV (kv_load_failure_policy=fail), compounding to deadlock.

Verdict: neither ratio tuning nor routing policy rescues static
PD-disagg for this agentic workload — the failure is structural.

mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 00:25:10 +08:00
2e6a369046 PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers
Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token
KV transfer fails -> negative prompt-token counter -> prometheus
ValueError -> engine dead -> cliff failure). Explains the round-robin
6P+2D rep variance (100/56/80%) as intermittent consumer death, and
notes the counter-clamp patch needed to compare routing arms fairly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:02:21 +08:00
8596135680 MB5 analysis: per-role KV split proves static-partition mismatch
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
  PID->role map parsed from vllm_logs filenames. The cluster average
  hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
  is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
  host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
  (matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.

PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.

Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:05:17 +08:00