From a2111b6e18fed51a26869ce198867c8c38cd37d8 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Sun, 31 May 2026 20:14:14 +0800 Subject: [PATCH] PD-disagg docs: annotated corrections for e13391e contamination MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds dated, non-destructive correction notes to the contaminated PD-vs-colo artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on `finished_sending`, deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) was found and gated behind `VLLM_EVICT_SENT_BLOCKS` (default off). PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker. §6 (session-affinity hot-pin) was an `e13391e` artifact under controlled concurrency; §3 RR, §4 TPOT win, §5 D-pool ceiling, §5.1 consumer crash stand. RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms the D-pool capacity thesis and adds regime- dependence. pd_separation_analysis.md scoped caution: thesis confirmed; flags only reuse-dependent figures for cross-check (this study used a different stack). figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted; §3 RR and §5 D-pool figures stand. --- RESULTS_SUMMARY.md | 12 +++++ analysis/pd_separation_analysis.md | 16 ++++++ figs/mb5/CORRECTION.md | 19 ++++++++ microbench/fresh_setup/PD_DISAGG_RESULTS.md | 54 +++++++++++++++++++++ 4 files changed, 101 insertions(+) create mode 100644 figs/mb5/CORRECTION.md diff --git a/RESULTS_SUMMARY.md b/RESULTS_SUMMARY.md index e8ec1d9..4d1d8d4 100644 --- a/RESULTS_SUMMARY.md +++ b/RESULTS_SUMMARY.md @@ -97,6 +97,18 @@ dash1 GPU 0 单 instance(无 kv_connector),chunked-prefill 默认开启, - MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的,**但这件事被容量天花板压倒**。 - Paper §3.2 论证应该聚焦"D 池装不下",MB1/MB2 数据用作 supporting context(per-request transfer charge 量化、phase isolation benefit 量化)而不是 main argument。 +> ✅ **2026-05-30 更新 — 干净栈三轴 ablation 证实本节、并加 regime 细化。** +> 本节的容量论点(D 池容量天花板 / decode 减半)在修复 `e13391e` 污染后的 clean stack +> 上**得到确认**:concurrency 轴 N=64 时 PD 倾覆,**producer APC 从 71% 崩到 1.4%、TPS −30%**, +> 而 colo 线性 scale(Fig 3)。**细化**:PD 并非"在 agentic 上一律失败"——它在 +> *低负载 / decode-heavy / 低复用* 区间**赢** colo;真正失败的是 agentic corner(高复用 + +> 短输出 + 大上下文 + 高并发)——静态 P:D split 无法同时给出复用所需的 producer 容量 +> *和* decode 容量,而 colo 的弹性池两者兼得。 +> **另注**:旧 MB5 文档(`PD_DISAGG_RESULTS.md` §6)"session-affinity 救不了 PD / PD 复用=0%" +> 的结论是 `e13391e`(producer 每次 KV 传输后 evict prefix)的**污染产物,已撤回**; +> 干净栈上 session 路由的 producer APC 与 colo 持平(71–82%)。 +> 图:[`figs/mb5_pd_ablation/`](figs/mb5_pd_ablation/)。 + ## 5. EAR 设计的实证状态(§4) | Pillar | 已实证 | 待实证 | diff --git a/analysis/pd_separation_analysis.md b/analysis/pd_separation_analysis.md index 527497d..dc77f88 100644 --- a/analysis/pd_separation_analysis.md +++ b/analysis/pd_separation_analysis.md @@ -23,6 +23,22 @@ Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memo > Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5. | **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** | +> ✅⚠️ **2026-05-30 — confirmed + refined by the clean MB5 ablation; one caveat.** +> A producer-side contamination (`e13391e`: evicts a producer's prefix-cache on every +> KV transfer) was found in the *agentic-kv-fresh* MB5 stack and gated off; everything +> was re-run clean. +> - **Confirmed:** this doc's central thesis — PD's failure is a **decode-side KV memory +> wall**, not transfer/prefill cost — is reproduced on the clean stack (concurrency +> axis: at N=64 the split collapses, APC 71%→1.4%, TPS −30%; colo scales). Fig 3. +> - **Refined:** "PD separation is net negative" is **regime-dependent**, not universal. +> Clean ablation shows PD *wins* at low load / decode-heavy / low-reuse and loses the +> **agentic corner** (high reuse + short output + large context + high concurrency). +> - **Caveat (cross-check):** if this study's runs used the fork vLLM that contains +> `e13391e`, any **producer prefix-cache / APC reuse** figures here (e.g. §5.3.1) may be +> understated. The decode-memory-wall result is *not* reuse-dependent and is unaffected. +> On the clean stack, session-routed producers reach APC parity with colo (71–82%). +> Figures: [`figs/mb5_pd_ablation/`](../figs/mb5_pd_ablation/). + --- ## 1. Workload Characterization diff --git a/figs/mb5/CORRECTION.md b/figs/mb5/CORRECTION.md new file mode 100644 index 0000000..d4f7bee --- /dev/null +++ b/figs/mb5/CORRECTION.md @@ -0,0 +1,19 @@ +# ⚠️ Correction notice for figs/mb5/ (2026-05-30) + +These figures back `microbench/fresh_setup/PD_DISAGG_RESULTS.md`. A producer-side +contamination was later found in the stack that produced them: commit **`e13391e`** +(deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) evicts a +producer's prefix-cache blocks on every KV transfer, so a disaggregated producer +could never keep a session's prefix warm. It is now gated behind +`VLLM_EVICT_SENT_BLOCKS` (default off) and everything was re-run clean. + +| figure | section | status | +|---|---|---| +| `mb5_producer_hotspot.png` | §6.3 session-affinity hot-pinning | 🛑 **RETRACTED** — pure `e13391e` artifact. On the clean stack, session-routed producers reach APC parity with colo (71–82%); there is no 0%-util stall / hot-pin pathology. | +| `mb5_latency_compare.png` | §3 round-robin headline | ✅ stands — RR's ~0% prefix-hit is a *routing* artifact (consecutive turns → different producers), not the eviction bug; reproduced clean. | +| `mb5_kv_timeline.png`, `mb5_role_split.png`, `mb5_peak_utilization.png` | §5 per-role KV pool occupancy | ✅ D-pool capacity-ceiling mechanism stands (decode pegs while prefill strands). P-pool occupancy may read slightly low under eviction; the qualitative split is unaffected. | +| `mb5_summary.csv` | aggregate | mixed — §3/§5 rows valid; any session-affinity rows superseded. | + +**Superseded by the corrected three-axis ablation:** [`../mb5_pd_ablation/`](../mb5_pd_ablation/) +(reuse / shape / concurrency), data in [`../../analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/). +Raw §6 data `analysis/mb5/session_prod.json` is contaminated; `analysis/mb5/rr_prod.json` (round-robin) stands. diff --git a/microbench/fresh_setup/PD_DISAGG_RESULTS.md b/microbench/fresh_setup/PD_DISAGG_RESULTS.md index 6a07a21..8b1e510 100644 --- a/microbench/fresh_setup/PD_DISAGG_RESULTS.md +++ b/microbench/fresh_setup/PD_DISAGG_RESULTS.md @@ -10,6 +10,51 @@ Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instru --- +## ⚠️ CORRECTION (2026-05-30) — read before §6 + +A contamination was found in the "fresh" vLLM used for the runs below. +`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the +pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on +`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV +transfer**. So a disaggregated producer could never keep a session's prefix warm, +*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS` +(default off) and re-run everything on the corrected stack. + +**Retracted (was a pure artifact of `e13391e`):** +- **All of §6** ("smarter routing does not save PD" / "session-affinity is + *strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit + ~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to + full parity with colo (APC 71–82%)** — the collapse was the eviction bug starving + the very cache session-affinity exists to fill, not a routing pathology. +- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix + *exactly as well as colo* once routing is session-sticky. + +**Still stands (independent of `e13391e`):** +- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers, + so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is + reproduced on the clean stack; RR PD still loses to 8C. +- **§4** PD wins TPOT (decode isolation) — robust. +- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug. +- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill + strands) — real. + +**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing +can't help." On the clean stack PD is **regime-dependent**: it *wins* at low +load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse + +short output + large context + high concurrency) through a structural crossover — +its static P:D split cannot simultaneously provide the prefix-cache capacity +(needs many producers) *and* the decode capacity (needs many decoders) that +agentic demands at once, while colo's elastic pool provides both. See the +three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the +best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD +at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly. + +→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) · +data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) · +the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor. + +--- + ## TL;DR (verdict) **No static prefill/decode split beats 8-way colocation (8C) on this agentic @@ -205,6 +250,15 @@ single failed request, which is required to compare routing arms fairly in §6. ## 6. The routing handicap — and whether smarter routing rescues PD +> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.** +> The session-affinity runs below were starved by the producer-eviction bug, so +> they could never collect prefix-cache reuse. On the corrected stack +> session-affinity reaches **APC parity with colo (71–82%)** and does *not* stall +> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a +> routing pathology — see the CORRECTION banner at the top and +> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the +> record only. + Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That is not fundamental to disaggregation — it is the stock proxy round-robining the **prefill** side: consecutive turns of one agentic session land on