PD-disagg docs: annotated corrections for e13391e contamination
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).
PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker.
§6 (session-affinity hot-pin) was an `e13391e`
artifact under controlled concurrency; §3 RR, §4
TPOT win, §5 D-pool ceiling, §5.1 consumer crash
stand.
RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms
the D-pool capacity thesis and adds regime-
dependence.
pd_separation_analysis.md scoped caution: thesis confirmed; flags
only reuse-dependent figures for cross-check
(this study used a different stack).
figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted;
§3 RR and §5 D-pool figures stand.
This commit is contained in:
@@ -97,6 +97,18 @@ dash1 GPU 0 单 instance(无 kv_connector),chunked-prefill 默认开启,
|
|||||||
- MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的,**但这件事被容量天花板压倒**。
|
- MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的,**但这件事被容量天花板压倒**。
|
||||||
- Paper §3.2 论证应该聚焦"D 池装不下",MB1/MB2 数据用作 supporting context(per-request transfer charge 量化、phase isolation benefit 量化)而不是 main argument。
|
- Paper §3.2 论证应该聚焦"D 池装不下",MB1/MB2 数据用作 supporting context(per-request transfer charge 量化、phase isolation benefit 量化)而不是 main argument。
|
||||||
|
|
||||||
|
> ✅ **2026-05-30 更新 — 干净栈三轴 ablation 证实本节、并加 regime 细化。**
|
||||||
|
> 本节的容量论点(D 池容量天花板 / decode 减半)在修复 `e13391e` 污染后的 clean stack
|
||||||
|
> 上**得到确认**:concurrency 轴 N=64 时 PD 倾覆,**producer APC 从 71% 崩到 1.4%、TPS −30%**,
|
||||||
|
> 而 colo 线性 scale(Fig 3)。**细化**:PD 并非"在 agentic 上一律失败"——它在
|
||||||
|
> *低负载 / decode-heavy / 低复用* 区间**赢** colo;真正失败的是 agentic corner(高复用 +
|
||||||
|
> 短输出 + 大上下文 + 高并发)——静态 P:D split 无法同时给出复用所需的 producer 容量
|
||||||
|
> *和* decode 容量,而 colo 的弹性池两者兼得。
|
||||||
|
> **另注**:旧 MB5 文档(`PD_DISAGG_RESULTS.md` §6)"session-affinity 救不了 PD / PD 复用=0%"
|
||||||
|
> 的结论是 `e13391e`(producer 每次 KV 传输后 evict prefix)的**污染产物,已撤回**;
|
||||||
|
> 干净栈上 session 路由的 producer APC 与 colo 持平(71–82%)。
|
||||||
|
> 图:[`figs/mb5_pd_ablation/`](figs/mb5_pd_ablation/)。
|
||||||
|
|
||||||
## 5. EAR 设计的实证状态(§4)
|
## 5. EAR 设计的实证状态(§4)
|
||||||
|
|
||||||
| Pillar | 已实证 | 待实证 |
|
| Pillar | 已实证 | 待实证 |
|
||||||
|
|||||||
@@ -23,6 +23,22 @@ Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memo
|
|||||||
> Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5.
|
> Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5.
|
||||||
| **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** |
|
| **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** |
|
||||||
|
|
||||||
|
> ✅⚠️ **2026-05-30 — confirmed + refined by the clean MB5 ablation; one caveat.**
|
||||||
|
> A producer-side contamination (`e13391e`: evicts a producer's prefix-cache on every
|
||||||
|
> KV transfer) was found in the *agentic-kv-fresh* MB5 stack and gated off; everything
|
||||||
|
> was re-run clean.
|
||||||
|
> - **Confirmed:** this doc's central thesis — PD's failure is a **decode-side KV memory
|
||||||
|
> wall**, not transfer/prefill cost — is reproduced on the clean stack (concurrency
|
||||||
|
> axis: at N=64 the split collapses, APC 71%→1.4%, TPS −30%; colo scales). Fig 3.
|
||||||
|
> - **Refined:** "PD separation is net negative" is **regime-dependent**, not universal.
|
||||||
|
> Clean ablation shows PD *wins* at low load / decode-heavy / low-reuse and loses the
|
||||||
|
> **agentic corner** (high reuse + short output + large context + high concurrency).
|
||||||
|
> - **Caveat (cross-check):** if this study's runs used the fork vLLM that contains
|
||||||
|
> `e13391e`, any **producer prefix-cache / APC reuse** figures here (e.g. §5.3.1) may be
|
||||||
|
> understated. The decode-memory-wall result is *not* reuse-dependent and is unaffected.
|
||||||
|
> On the clean stack, session-routed producers reach APC parity with colo (71–82%).
|
||||||
|
> Figures: [`figs/mb5_pd_ablation/`](../figs/mb5_pd_ablation/).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 1. Workload Characterization
|
## 1. Workload Characterization
|
||||||
|
|||||||
19
figs/mb5/CORRECTION.md
Normal file
19
figs/mb5/CORRECTION.md
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
# ⚠️ Correction notice for figs/mb5/ (2026-05-30)
|
||||||
|
|
||||||
|
These figures back `microbench/fresh_setup/PD_DISAGG_RESULTS.md`. A producer-side
|
||||||
|
contamination was later found in the stack that produced them: commit **`e13391e`**
|
||||||
|
(deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) evicts a
|
||||||
|
producer's prefix-cache blocks on every KV transfer, so a disaggregated producer
|
||||||
|
could never keep a session's prefix warm. It is now gated behind
|
||||||
|
`VLLM_EVICT_SENT_BLOCKS` (default off) and everything was re-run clean.
|
||||||
|
|
||||||
|
| figure | section | status |
|
||||||
|
|---|---|---|
|
||||||
|
| `mb5_producer_hotspot.png` | §6.3 session-affinity hot-pinning | 🛑 **RETRACTED** — pure `e13391e` artifact. On the clean stack, session-routed producers reach APC parity with colo (71–82%); there is no 0%-util stall / hot-pin pathology. |
|
||||||
|
| `mb5_latency_compare.png` | §3 round-robin headline | ✅ stands — RR's ~0% prefix-hit is a *routing* artifact (consecutive turns → different producers), not the eviction bug; reproduced clean. |
|
||||||
|
| `mb5_kv_timeline.png`, `mb5_role_split.png`, `mb5_peak_utilization.png` | §5 per-role KV pool occupancy | ✅ D-pool capacity-ceiling mechanism stands (decode pegs while prefill strands). P-pool occupancy may read slightly low under eviction; the qualitative split is unaffected. |
|
||||||
|
| `mb5_summary.csv` | aggregate | mixed — §3/§5 rows valid; any session-affinity rows superseded. |
|
||||||
|
|
||||||
|
**Superseded by the corrected three-axis ablation:** [`../mb5_pd_ablation/`](../mb5_pd_ablation/)
|
||||||
|
(reuse / shape / concurrency), data in [`../../analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/).
|
||||||
|
Raw §6 data `analysis/mb5/session_prod.json` is contaminated; `analysis/mb5/rr_prod.json` (round-robin) stands.
|
||||||
@@ -10,6 +10,51 @@ Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instru
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## ⚠️ CORRECTION (2026-05-30) — read before §6
|
||||||
|
|
||||||
|
A contamination was found in the "fresh" vLLM used for the runs below.
|
||||||
|
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
|
||||||
|
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
|
||||||
|
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
|
||||||
|
transfer**. So a disaggregated producer could never keep a session's prefix warm,
|
||||||
|
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
|
||||||
|
(default off) and re-run everything on the corrected stack.
|
||||||
|
|
||||||
|
**Retracted (was a pure artifact of `e13391e`):**
|
||||||
|
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
|
||||||
|
*strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
|
||||||
|
~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
|
||||||
|
full parity with colo (APC 71–82%)** — the collapse was the eviction bug starving
|
||||||
|
the very cache session-affinity exists to fill, not a routing pathology.
|
||||||
|
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
|
||||||
|
*exactly as well as colo* once routing is session-sticky.
|
||||||
|
|
||||||
|
**Still stands (independent of `e13391e`):**
|
||||||
|
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
|
||||||
|
so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
|
||||||
|
reproduced on the clean stack; RR PD still loses to 8C.
|
||||||
|
- **§4** PD wins TPOT (decode isolation) — robust.
|
||||||
|
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
|
||||||
|
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
|
||||||
|
strands) — real.
|
||||||
|
|
||||||
|
**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
|
||||||
|
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
|
||||||
|
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
|
||||||
|
short output + large context + high concurrency) through a structural crossover —
|
||||||
|
its static P:D split cannot simultaneously provide the prefix-cache capacity
|
||||||
|
(needs many producers) *and* the decode capacity (needs many decoders) that
|
||||||
|
agentic demands at once, while colo's elastic pool provides both. See the
|
||||||
|
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
|
||||||
|
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
|
||||||
|
at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly.
|
||||||
|
|
||||||
|
→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
|
||||||
|
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
|
||||||
|
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## TL;DR (verdict)
|
## TL;DR (verdict)
|
||||||
|
|
||||||
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
|
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
|
||||||
@@ -205,6 +250,15 @@ single failed request, which is required to compare routing arms fairly in §6.
|
|||||||
|
|
||||||
## 6. The routing handicap — and whether smarter routing rescues PD
|
## 6. The routing handicap — and whether smarter routing rescues PD
|
||||||
|
|
||||||
|
> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
|
||||||
|
> The session-affinity runs below were starved by the producer-eviction bug, so
|
||||||
|
> they could never collect prefix-cache reuse. On the corrected stack
|
||||||
|
> session-affinity reaches **APC parity with colo (71–82%)** and does *not* stall
|
||||||
|
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
|
||||||
|
> routing pathology — see the CORRECTION banner at the top and
|
||||||
|
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
|
||||||
|
> record only.
|
||||||
|
|
||||||
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
|
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
|
||||||
is not fundamental to disaggregation — it is the stock proxy round-robining
|
is not fundamental to disaggregation — it is the stock proxy round-robining
|
||||||
the **prefill** side: consecutive turns of one agentic session land on
|
the **prefill** side: consecutive turns of one agentic session land on
|
||||||
|
|||||||
Reference in New Issue
Block a user