Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).
PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker.
§6 (session-affinity hot-pin) was an `e13391e`
artifact under controlled concurrency; §3 RR, §4
TPOT win, §5 D-pool ceiling, §5.1 consumer crash
stand.
RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms
the D-pool capacity thesis and adds regime-
dependence.
pd_separation_analysis.md scoped caution: thesis confirmed; flags
only reuse-dependent figures for cross-check
(this study used a different stack).
figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted;
§3 RR and §5 D-pool figures stand.
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)
A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.
plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Swept session-affinity P routing (MB5_P_ROUTING=session) across all
four ratios on the metrics-fixed stack. Findings:
- Strictly worse than round-robin at every ratio. 4P+4D: round-robin
100% vs session-affinity 36% completion.
- Success DECREASES monotonically as decode capacity grows
(6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the
"session prefill is faster so it needs more D" hypothesis.
- GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster
stalls on KV-transfer/admission coordination, not compute. This is
the deepest anti-PD argument: paid-for hardware does nothing while
requests pile up; colocation keeps every GPU busy.
- Mechanism: session-affinity pins heavy multi-turn sessions onto
single producers (producer hot-pinning, same pathology as sticky
routing in the colocated §3.3 study); fewer producers -> worse
concentration -> the monotonic decline. Failed transfers also pin
producer KV (kv_load_failure_policy=fail), compounding to deadlock.
Verdict: neither ratio tuning nor routing policy rescues static
PD-disagg for this agentic workload — the failure is structural.
mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
PID->role map parsed from vllm_logs filenames. The cluster average
hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
(matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.
PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.
Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>