PD-disagg docs: annotated corrections for e13391e contamination
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).
PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker.
§6 (session-affinity hot-pin) was an `e13391e`
artifact under controlled concurrency; §3 RR, §4
TPOT win, §5 D-pool ceiling, §5.1 consumer crash
stand.
RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms
the D-pool capacity thesis and adds regime-
dependence.
pd_separation_analysis.md scoped caution: thesis confirmed; flags
only reuse-dependent figures for cross-check
(this study used a different stack).
figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted;
§3 RR and §5 D-pool figures stand.
This commit is contained in:
@@ -10,6 +10,51 @@ Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instru
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ CORRECTION (2026-05-30) — read before §6
|
||||
|
||||
A contamination was found in the "fresh" vLLM used for the runs below.
|
||||
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
|
||||
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
|
||||
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
|
||||
transfer**. So a disaggregated producer could never keep a session's prefix warm,
|
||||
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
|
||||
(default off) and re-run everything on the corrected stack.
|
||||
|
||||
**Retracted (was a pure artifact of `e13391e`):**
|
||||
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
|
||||
*strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
|
||||
~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
|
||||
full parity with colo (APC 71–82%)** — the collapse was the eviction bug starving
|
||||
the very cache session-affinity exists to fill, not a routing pathology.
|
||||
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
|
||||
*exactly as well as colo* once routing is session-sticky.
|
||||
|
||||
**Still stands (independent of `e13391e`):**
|
||||
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
|
||||
so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
|
||||
reproduced on the clean stack; RR PD still loses to 8C.
|
||||
- **§4** PD wins TPOT (decode isolation) — robust.
|
||||
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
|
||||
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
|
||||
strands) — real.
|
||||
|
||||
**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
|
||||
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
|
||||
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
|
||||
short output + large context + high concurrency) through a structural crossover —
|
||||
its static P:D split cannot simultaneously provide the prefix-cache capacity
|
||||
(needs many producers) *and* the decode capacity (needs many decoders) that
|
||||
agentic demands at once, while colo's elastic pool provides both. See the
|
||||
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
|
||||
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
|
||||
at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly.
|
||||
|
||||
→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
|
||||
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
|
||||
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR (verdict)
|
||||
|
||||
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
|
||||
@@ -205,6 +250,15 @@ single failed request, which is required to compare routing arms fairly in §6.
|
||||
|
||||
## 6. The routing handicap — and whether smarter routing rescues PD
|
||||
|
||||
> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
|
||||
> The session-affinity runs below were starved by the producer-eviction bug, so
|
||||
> they could never collect prefix-cache reuse. On the corrected stack
|
||||
> session-affinity reaches **APC parity with colo (71–82%)** and does *not* stall
|
||||
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
|
||||
> routing pathology — see the CORRECTION banner at the top and
|
||||
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
|
||||
> record only.
|
||||
|
||||
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
|
||||
is not fundamental to disaggregation — it is the stock proxy round-robining
|
||||
the **prefill** side: consecutive turns of one agentic session land on
|
||||
|
||||
Reference in New Issue
Block a user