PD-disagg docs: annotated corrections for e13391e contamination

Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.
This commit is contained in:
2026-05-31 20:14:14 +08:00
parent 0b180c191e
commit a2111b6e18
4 changed files with 101 additions and 0 deletions

19
figs/mb5/CORRECTION.md Normal file
View File

@@ -0,0 +1,19 @@
# ⚠️ Correction notice for figs/mb5/ (2026-05-30)
These figures back `microbench/fresh_setup/PD_DISAGG_RESULTS.md`. A producer-side
contamination was later found in the stack that produced them: commit **`e13391e`**
(deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) evicts a
producer's prefix-cache blocks on every KV transfer, so a disaggregated producer
could never keep a session's prefix warm. It is now gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off) and everything was re-run clean.
| figure | section | status |
|---|---|---|
| `mb5_producer_hotspot.png` | §6.3 session-affinity hot-pinning | 🛑 **RETRACTED** — pure `e13391e` artifact. On the clean stack, session-routed producers reach APC parity with colo (7182%); there is no 0%-util stall / hot-pin pathology. |
| `mb5_latency_compare.png` | §3 round-robin headline | ✅ stands — RR's ~0% prefix-hit is a *routing* artifact (consecutive turns → different producers), not the eviction bug; reproduced clean. |
| `mb5_kv_timeline.png`, `mb5_role_split.png`, `mb5_peak_utilization.png` | §5 per-role KV pool occupancy | ✅ D-pool capacity-ceiling mechanism stands (decode pegs while prefill strands). P-pool occupancy may read slightly low under eviction; the qualitative split is unaffected. |
| `mb5_summary.csv` | aggregate | mixed — §3/§5 rows valid; any session-affinity rows superseded. |
**Superseded by the corrected three-axis ablation:** [`../mb5_pd_ablation/`](../mb5_pd_ablation/)
(reuse / shape / concurrency), data in [`../../analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/).
Raw §6 data `analysis/mb5/session_prod.json` is contaminated; `analysis/mb5/rr_prod.json` (round-robin) stands.