PD-disagg docs: annotated corrections for e13391e contamination

Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.
This commit is contained in:
2026-05-31 20:14:14 +08:00
parent 0b180c191e
commit a2111b6e18
4 changed files with 101 additions and 0 deletions

View File

@@ -10,6 +10,51 @@ Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instru
---
## ⚠️ CORRECTION (2026-05-30) — read before §6
A contamination was found in the "fresh" vLLM used for the runs below.
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
transfer**. So a disaggregated producer could never keep a session's prefix warm,
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
(default off) and re-run everything on the corrected stack.
**Retracted (was a pure artifact of `e13391e`):**
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
*strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
full parity with colo (APC 7182%)** — the collapse was the eviction bug starving
the very cache session-affinity exists to fill, not a routing pathology.
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
*exactly as well as colo* once routing is session-sticky.
**Still stands (independent of `e13391e`):**
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
reproduced on the clean stack; RR PD still loses to 8C.
- **§4** PD wins TPOT (decode isolation) — robust.
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
strands) — real.
**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
short output + large context + high concurrency) through a structural crossover —
its static P:D split cannot simultaneously provide the prefix-cache capacity
(needs many producers) *and* the decode capacity (needs many decoders) that
agentic demands at once, while colo's elastic pool provides both. See the
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
at N=64 (APC craters 71%→1.4%, TPS 30%) while colo scales cleanly.
→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.
---
## TL;DR (verdict)
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
@@ -205,6 +250,15 @@ single failed request, which is required to compare routing arms fairly in §6.
## 6. The routing handicap — and whether smarter routing rescues PD
> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
> The session-affinity runs below were starved by the producer-eviction bug, so
> they could never collect prefix-cache reuse. On the corrected stack
> session-affinity reaches **APC parity with colo (7182%)** and does *not* stall
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
> routing pathology — see the CORRECTION banner at the top and
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
> record only.
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
is not fundamental to disaggregation — it is the stock proxy round-robining
the **prefill** side: consecutive turns of one agentic session land on