PD-disagg docs: annotated corrections for e13391e contamination

Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.
This commit is contained in:
2026-05-31 20:14:14 +08:00
parent 0b180c191e
commit a2111b6e18
4 changed files with 101 additions and 0 deletions

View File

@@ -97,6 +97,18 @@ dash1 GPU 0 单 instance无 kv_connectorchunked-prefill 默认开启,
- MB1 + MB2 的合计 cost-benefit phase isolation 维度上 PD-disagg 是赢的**但这件事被容量天花板压倒**。 - MB1 + MB2 的合计 cost-benefit phase isolation 维度上 PD-disagg 是赢的**但这件事被容量天花板压倒**。
- Paper §3.2 论证应该聚焦"D 池装不下"MB1/MB2 数据用作 supporting contextper-request transfer charge 量化phase isolation benefit 量化而不是 main argument - Paper §3.2 论证应该聚焦"D 池装不下"MB1/MB2 数据用作 supporting contextper-request transfer charge 量化phase isolation benefit 量化而不是 main argument
> ✅ **2026-05-30 更新 — 干净栈三轴 ablation 证实本节、并加 regime 细化。**
> 本节的容量论点D 池容量天花板 / decode 减半)在修复 `e13391e` 污染后的 clean stack
> 上**得到确认**concurrency 轴 N=64 时 PD 倾覆,**producer APC 从 71% 崩到 1.4%、TPS 30%**
> 而 colo 线性 scaleFig 3。**细化**PD 并非"在 agentic 上一律失败"——它在
> *低负载 / decode-heavy / 低复用* 区间**赢** colo真正失败的是 agentic corner高复用 +
> 短输出 + 大上下文 + 高并发)——静态 P:D split 无法同时给出复用所需的 producer 容量
> *和* decode 容量,而 colo 的弹性池两者兼得。
> **另注**:旧 MB5 文档(`PD_DISAGG_RESULTS.md` §6"session-affinity 救不了 PD / PD 复用=0%"
> 的结论是 `e13391e`producer 每次 KV 传输后 evict prefix的**污染产物,已撤回**
> 干净栈上 session 路由的 producer APC 与 colo 持平7182%)。
> 图:[`figs/mb5_pd_ablation/`](figs/mb5_pd_ablation/)。
## 5. EAR 设计的实证状态§4 ## 5. EAR 设计的实证状态§4
| Pillar | 已实证 | 待实证 | | Pillar | 已实证 | 待实证 |

View File

@@ -23,6 +23,22 @@ Per-request breakdown shows **87.7% of TTFT** is spent waiting for KV cache memo
> Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5. > Earlier cross-machine comparison (commit `1e86285`) was invalidated — baseline used warm instances. See REPORT.md §3.5.
| **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** | | **Delta** | **-45%** | **-36%** | **-44%** | **+30pp** |
> ✅⚠️ **2026-05-30 — confirmed + refined by the clean MB5 ablation; one caveat.**
> A producer-side contamination (`e13391e`: evicts a producer's prefix-cache on every
> KV transfer) was found in the *agentic-kv-fresh* MB5 stack and gated off; everything
> was re-run clean.
> - **Confirmed:** this doc's central thesis — PD's failure is a **decode-side KV memory
> wall**, not transfer/prefill cost — is reproduced on the clean stack (concurrency
> axis: at N=64 the split collapses, APC 71%→1.4%, TPS 30%; colo scales). Fig 3.
> - **Refined:** "PD separation is net negative" is **regime-dependent**, not universal.
> Clean ablation shows PD *wins* at low load / decode-heavy / low-reuse and loses the
> **agentic corner** (high reuse + short output + large context + high concurrency).
> - **Caveat (cross-check):** if this study's runs used the fork vLLM that contains
> `e13391e`, any **producer prefix-cache / APC reuse** figures here (e.g. §5.3.1) may be
> understated. The decode-memory-wall result is *not* reuse-dependent and is unaffected.
> On the clean stack, session-routed producers reach APC parity with colo (7182%).
> Figures: [`figs/mb5_pd_ablation/`](../figs/mb5_pd_ablation/).
--- ---
## 1. Workload Characterization ## 1. Workload Characterization

19
figs/mb5/CORRECTION.md Normal file
View File

@@ -0,0 +1,19 @@
# ⚠️ Correction notice for figs/mb5/ (2026-05-30)
These figures back `microbench/fresh_setup/PD_DISAGG_RESULTS.md`. A producer-side
contamination was later found in the stack that produced them: commit **`e13391e`**
(deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) evicts a
producer's prefix-cache blocks on every KV transfer, so a disaggregated producer
could never keep a session's prefix warm. It is now gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off) and everything was re-run clean.
| figure | section | status |
|---|---|---|
| `mb5_producer_hotspot.png` | §6.3 session-affinity hot-pinning | 🛑 **RETRACTED** — pure `e13391e` artifact. On the clean stack, session-routed producers reach APC parity with colo (7182%); there is no 0%-util stall / hot-pin pathology. |
| `mb5_latency_compare.png` | §3 round-robin headline | ✅ stands — RR's ~0% prefix-hit is a *routing* artifact (consecutive turns → different producers), not the eviction bug; reproduced clean. |
| `mb5_kv_timeline.png`, `mb5_role_split.png`, `mb5_peak_utilization.png` | §5 per-role KV pool occupancy | ✅ D-pool capacity-ceiling mechanism stands (decode pegs while prefill strands). P-pool occupancy may read slightly low under eviction; the qualitative split is unaffected. |
| `mb5_summary.csv` | aggregate | mixed — §3/§5 rows valid; any session-affinity rows superseded. |
**Superseded by the corrected three-axis ablation:** [`../mb5_pd_ablation/`](../mb5_pd_ablation/)
(reuse / shape / concurrency), data in [`../../analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/).
Raw §6 data `analysis/mb5/session_prod.json` is contaminated; `analysis/mb5/rr_prod.json` (round-robin) stands.

View File

@@ -10,6 +10,51 @@ Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instru
--- ---
## ⚠️ CORRECTION (2026-05-30) — read before §6
A contamination was found in the "fresh" vLLM used for the runs below.
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
transfer**. So a disaggregated producer could never keep a session's prefix warm,
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
(default off) and re-run everything on the corrected stack.
**Retracted (was a pure artifact of `e13391e`):**
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
*strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
full parity with colo (APC 7182%)** — the collapse was the eviction bug starving
the very cache session-affinity exists to fill, not a routing pathology.
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
*exactly as well as colo* once routing is session-sticky.
**Still stands (independent of `e13391e`):**
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
reproduced on the clean stack; RR PD still loses to 8C.
- **§4** PD wins TPOT (decode isolation) — robust.
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
strands) — real.
**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
short output + large context + high concurrency) through a structural crossover —
its static P:D split cannot simultaneously provide the prefix-cache capacity
(needs many producers) *and* the decode capacity (needs many decoders) that
agentic demands at once, while colo's elastic pool provides both. See the
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
at N=64 (APC craters 71%→1.4%, TPS 30%) while colo scales cleanly.
→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.
---
## TL;DR (verdict) ## TL;DR (verdict)
**No static prefill/decode split beats 8-way colocation (8C) on this agentic **No static prefill/decode split beats 8-way colocation (8C) on this agentic
@@ -205,6 +250,15 @@ single failed request, which is required to compare routing arms fairly in §6.
## 6. The routing handicap — and whether smarter routing rescues PD ## 6. The routing handicap — and whether smarter routing rescues PD
> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
> The session-affinity runs below were starved by the producer-eviction bug, so
> they could never collect prefix-cache reuse. On the corrected stack
> session-affinity reaches **APC parity with colo (7182%)** and does *not* stall
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
> routing pathology — see the CORRECTION banner at the top and
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
> record only.
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
is not fundamental to disaggregation — it is the stock proxy round-robining is not fundamental to disaggregation — it is the stock proxy round-robining
the **prefill** side: consecutive turns of one agentic session land on the **prefill** side: consecutive turns of one agentic session land on