Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 22:04:49 +08:00
parent abde010b64
commit da39ab6804
6 changed files with 155 additions and 178 deletions

View File

@@ -39,52 +39,63 @@ Production trace = Qwen3-Coder agentic1.3 M sessions / 2.1 M reqs / 7200 s。
参考图:`figs/f4a_apc_loss.png``figs/f4b_pdsep_kv_wall.png``figs/f4c_per_worker_ttft.png``figs/f6_e2e_latency_bars.png``figs/f6_e2e_latency_full_grid.png` 参考图:`figs/f4a_apc_loss.png``figs/f4b_pdsep_kv_wall.png``figs/f4c_per_worker_ttft.png``figs/f6_e2e_latency_bars.png``figs/f6_e2e_latency_full_grid.png`
## 4. PD-disagg 在 agentic 下输——cost vs benefit§3.2 ## 4. Static PD-disagg 为什么失败§3.2)—— 容量问题,不是 cost-benefit 问题
由两个独立 microbench 钉死(**全用 vanilla vLLM 0.18.1 + Mooncake 0.3.11fresh venv无 patch** **2026-05-27 纠正**:本节前一版本论证"PD-disagg 因为 transfer cost > phase isolation benefit 而失败"。**这个论证算错了**。正确的 phase-isolation benefit 是**每个 prefill 事件 × D 个 concurrent stream** 的总和(≈ `D × T_prefill`),不是单个 request 的 decode 时长。用正确公式PD-disagg 在 phase-isolation 这一维上**赢 colo 一两个数量级**。Static PD-disagg 在 agentic 上失败的**真正根因是 D 侧 KV pool 容量**,不是这一维
### 4.1 MB2 — KV transfer cost ### 4.1 真正的失败模式D 侧 KV 容量天花板
dash1 GPU 0+1intra-node和 dash1 ↔ dash2inter-node, 200 Gbps RoCE扫 9 个 size × 5 reps。 | | 8C colo | 4P+4D PD-disagg |
|---|---:|---:|
| Per-D-instance KV pool0.4 × 96 GiB | 38 GiB | 38 GiB |
| 系统总 decode 容量D 实例数 × 单池) | 8 × 38 = **304 GiB** | 4 × 38 = **152 GiB** |
| p99 单请求 KV = 11.5 GiB → 最多并发 decode | 24 | **12减半** |
| 路径 | 稳态带宽(≤ 3 GiB | p99 agentic 请求11.5 GiBtransfer 时间 | Colleague 4P+4D 实测TTFT p50 0.91 s → **62.8 s62×**、success rate **99.5% → 52%**。失败模式:**D 池溢出 + 排队**,不是 transfer 延迟。
参考图:`figs/f4b_pdsep_kv_wall.png`pdf 版本是高质量 paper figure
### 4.2 MB2 — KV transfer costper-request 一次性成本,**不**是 dominant cost
dash1 GPU 0+1intra和 dash1 ↔ dash2inter, 200 Gbps RoCE扫 9 个 size × 5 reps。
| 路径 | 稳态带宽(≤ 3 GiB | p99 agentic 请求 11.5 GiB transfer |
|---|---|---| |---|---|---|
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · min 1.5 s · max 10 s | | Intra-node | **9.7 GB/s** | p50 **1.9 s** · max 10 s |
| Inter-node | **10.0 GB/s**(差 <3% | p50 **1.7 s** · min 1.3 s · max 9.2 s | | Inter-node | **10.0 GB/s**(差 <3% | p50 **1.7 s** · max 9.2 s |
**新发现**intra/inter 几乎重合 **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC,包括 intra-node loopback**不走 NVLink200 Gbps NIC 是天花板**PD-disagg transfer cost 与拓扑无关**。 **新发现**intra/inter 几乎重合 **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC**不走 NVLink200 Gbps NIC 是天花板**PD-disagg transfer cost 与拓扑无关**。
参考图`figs/mb2_transfer_time_compare.png``figs/mb2_transfer_bw_compare.png`doc `analysis/mb2/README.md` 参考图`figs/mb2_transfer_time_compare.png`doc `analysis/mb2/README.md`
### 4.2 MB1 — Phase interferencechunked-prefill on, 默认 baseline ### 4.3 MB1 — Phase interferencePD-disagg 的潜在 benefit 上界
dash1 GPU 0 instanceDconcurrent decodes× Pprefill size扫描 dash1 GPU 0 instance kv_connectorchunked-prefill 默认开启D × P 扫描D=8 结果
D=8 agentic-realistic的结果 | Prefill | T_prefill | per-stream TPOT during | penalty |
| Prefill | prefill_ttft | per-stream TPOT during | penalty |
|---|---:|---:|---:| |---|---:|---:|---:|
| 2k tok | 143 ms | 32 ms | 4× | | 2k tok | 143 ms | 32 ms | 4× |
| 8k | 583 ms | 114 ms | 15× | | 32k tok | 4.5 s | 388 ms | **52×** |
| 32k | 4.5 s | 388 ms | **52×** | | 131k tok | 57 s | 1419 ms | **183×** |
| 65k | 15.6 s | 757 ms | **99×** |
| 131k | 57 s | 1419 ms | **183×** |
baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunked-prefill 已经默认开启PD-disagg 在它之上能额外提供的 phase isolation = **decode 在 prefill 期间被 halted 的那部分时间** **Decode 在 prefill 期间被几乎完全 halted** stream 损失 `T_prefill` 的时间。**每个 prefill event decode 损失 `D × T_prefill`**
参考图`figs/mb1_interference.png`doc `analysis/mb1/README.md` 参考图`figs/mb1_interference.png`doc `analysis/mb1/README.md`
### 4.3 联合结论 ### 4.4 联合 cost-benefitper-prefill event
| | Per-request | | Prefill (KV size) | T_prefill | Cost = T_transfer | Benefit = D × T_prefill (D=8) | Cost / Benefit |
|---|---| |---:|---:|---:|---:|---:|
| **Max PD-disagg benefit**救回来的 decode 时间| **decode 时长 = 50200 ms**agentic tool-call output| | 2k tok (192 MiB) | 0.14 s | 8 ms | 1.1 s | **0.7%** |
| **PD-disagg cost**MB2 transfer p50| 80 MiB 8 ms · 3 GiB 320 ms · 11.5 GiB **1.9 s**p99 实测最差 10 s| | 33k tok (3 GiB, trace mean) | 4.5 s | 0.32 s | 36 s | **0.9%** |
| Cost / Benefit | **每个 KV ≥ 80 MiB 的请求都输**trace 平均 KV 192 MiB 已经输 | | 125k tok (12 GiB, ~p99) | 57 s | 1.9 s | 456 s | **0.4%** |
**结论** agentic **PD-disaggregation 是结构性失败的**Chunked-prefill 默认已经在 colocation 内做了 first-order phase isolationPD-disagg 在此之上能额外补的decode 短时段没被 prefill 小于它新带来的每个 routed 请求都付 KV transfer)。这个结论与拓扑无关intra-node inter-node 一样)。 **PD-disagg 在 phase-isolation 这一维赢 100×250×****这不是 §3.2 该用的论证**因为 §3.2 真正的 dominant failure §4.1 D 池容量天花板颠覆了上面的全部数学)。
参考图`figs/pd_cost_vs_benefit.png`(§3.2 headline)。 **总结**
- D KV 容量天花板(§4.1)→ PD-disagg agentic **结构性失败**。
- MB1 + MB2 的合计 cost-benefit phase isolation 维度上 PD-disagg 是赢的**但这件事被容量天花板压倒**。
- Paper §3.2 论证应该聚焦"D 池装不下"MB1/MB2 数据用作 supporting contextper-request transfer charge 量化phase isolation benefit 量化而不是 main argument
## 5. EAR 设计的实证状态§4 ## 5. EAR 设计的实证状态§4
@@ -96,10 +107,12 @@ baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunke
## 6. 已经能写的 paper 主张(按 confidence 排序) ## 6. 已经能写的 paper 主张(按 confidence 排序)
1. **Agentic vs chatbot 在调度上是不同 regime**dispatch coupling + sub-second tool-call mass)—— 实证完整 1. **Agentic vs chatbot 在调度上是不同 regime**dispatch coupling + sub-second tool-call mass)—— 实证完整
2. **PD-disaggregation 在 agentic 下输**cost > benefit跨拓扑—— **MB1 + MB2 实证完整** 2. **三类现有调度 baseline 各自的失败模式**load-balance / static PD-disagg / pure sticky)—— 实证完整
3. **三类现有调度 baseline 各自的失败模式** —— 实证完整 3. **Static PD-disagg 在 agentic 下失败的 dominant 根因是 D 侧 KV 容量**不是 phase-isolation cost-benefit)—— 实证完整`f4b` + colleague 4P+4D 数据
4. **Affinity-default 调度current unified达到 APC 上界**per-worker latency 也压倒 sticky —— 实证完整 4. **Mooncake transfer cost 拓扑无关**intra inter~9.7 GB/s NIC 上限)—— 实证完整MB2
5. **Hot-triggered migration 修复 sticky 的 hot pin** —— **design 完整、e2e 待验证** 5. **Phase isolation interference 在 chunked-prefill on 下仍然显著**per-stream TPOT during prefill 15×2000× baseline)—— 实证完整MB1)。**注意**这条数据本身不直接论证 "PD-disagg 失败"因为算正确账后 PD-disagg 反而在这一维上赢它的用途是给 §3.2 提供 phase-isolation benefit 上界的量化
6. **Affinity-default 调度current unified达到 APC 上界**per-worker latency 也压倒 sticky —— 实证完整
7. **Hot-triggered migration 修复 sticky 的 hot pin** —— design 完整e2e 待验证
## 7. 待做 ## 7. 待做

View File

@@ -15,20 +15,28 @@ bottom; the **Summary** block is what gets cited.
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** | | Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** | | Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** | | Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
| Maximum PD-disagg benefit per agentic decode | **≤ 50200 ms** (= decode duration) |
**§3.2 headline (cost vs benefit, this run + MB2)**: **What MB1 actually measures**:
> Under chunked-prefill, every ongoing decode stream is essentially > During a prefill burst, every ongoing decode stream is essentially
> **halted while a prefill chunk is in flight** — per-stream effective > halted (per-stream effective TPOT is 15×2000× baseline, scaling with
> TPOT during the burst is 15× to 2000× baseline, scaling with prefill > prefill size). The **total decode time lost per prefill event is
> size. PD-disagg can recover this stall, but the recovery is bounded by > `D × T_prefill`** (D concurrent decodes each lose ~T_prefill of useful
> the **decode duration** of the request being protected. For agentic, > work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8
> decode is 50200 ms (tool-call output). MB2 shows PD-disagg pays > that's **~36 seconds of decode-equivalent work lost per request**.
> 300 ms 10 s of KV-transfer cost per request to do that recovery. The > This is the **upper bound on what PD-disaggregation's phase isolation
> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB > could recover** on the decode side.
> (~830 tokens) — well below all agentic operating points. The benefit
> never beats the cost in this workload. **⚠ Correction (2026-05-27)**: an earlier version of this README framed
the §3.2 PD-disagg argument as "phase-isolation benefit is capped at
the decode duration of the new request (~50200 ms), so MB2 transfer
cost dominates". That framing was wrong. The correct accounting is
benefit-per-prefill-event = D × T_prefill (aggregate decode time saved
across all stalled streams), which is **much larger than per-request
transfer cost**. The actual reason static PD-disagg fails in agentic
is **D-side KV pool capacity** (`figs/f4b_pdsep_kv_wall.png`), not a
cost-vs-benefit imbalance on phase isolation. See `RESULTS_SUMMARY.md`
section 4 for the corrected framing.
## Setup ## Setup
@@ -107,25 +115,30 @@ the cleanest "average over the whole burst window" number).
halts decode). This is the entire prefill duration's worth of decode halts decode). This is the entire prefill duration's worth of decode
time that could in principle be recovered. time that could in principle be recovered.
Two big caveats for **agentic** application: **Connecting to the §3.2 PD-disagg argument** (corrected):
1. **Decode is short** (~50200 ms for tool-call output). The actual PD-disagg's promised phase-isolation benefit is **per prefill event**,
recoverable benefit per request is bounded by the decode duration, not per request. When a new prefill arrives, it stalls every concurrent
not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill decode stream on the same GPU. The aggregate decode time lost across
collides with it, PD-disagg can save at most 100 ms — not 5 s. those D streams is `D × T_prefill`. PD-disagg moving prefill off-decode-GPU
2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms 10 s per request recovers all of it.
for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
~100 ms benefit ceiling. Cost > benefit across the whole agentic
distribution.
## §3.2 cost-vs-benefit figure Plugging numbers per prefill event:
`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50200 ms | Prefill size | T_prefill | PD-disagg cost (MB2 T_transfer) | PD-disagg benefit (D=8 × T_prefill) | Ratio |
band, capped by decode duration) on top of MB2 transfer cost curve. The |---:|---:|---:|---:|---:|
cost curve crosses the benefit ceiling somewhere around **80 MiB / 830 | 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s | 0.7 % |
tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace | 33k tok (trace mean) | 4.5 s | 320 ms | 36 s | 0.9 % |
mean per request KV, and we know agentic averages 33k tokens, p99 | 125k tok (~p99) | 57 s | 1.9 s | 456 s | 0.4 % |
125k). For anything bigger PD-disagg pays more than it can recover.
On the **phase-isolation axis alone**, PD-disagg wins by 100×250×.
The reason static PD-disagg nonetheless **fails in agentic** is a
*different* failure mode: the D-side KV pool cannot fit p90+ requests
(p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide
decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's
4P+4D experiment). The structural problem is **capacity** (see
`figs/f4b_pdsep_kv_wall.png`), not transfer-cost vs phase-isolation
trade-off.
## Reproduction ## Reproduction
@@ -174,4 +187,7 @@ ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on 3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`. dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`. Figure: `figs/mb1_interference.png`. The figure
`figs/pd_cost_vs_benefit.png` from the original commit `029821c` was
based on the wrong "benefit ≤ decode duration" accounting; **deleted in
the correction commit**.

View File

@@ -24,12 +24,25 @@ get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
ceiling applies regardless. Halving the transfer cost cannot be bought ceiling applies regardless. Halving the transfer cost cannot be bought
back by topology. back by topology.
**Headline for the paper §3.2**: at the agentic tail, **pure KV transfer **What MB2 actually measures**: the **per-request charge** that
takes 1.5 10 s**. A median agentic decode is **50 200 ms** of tool-call PD-disagg pays for every routed request — `T_transfer ≈ KV_size / 9.7
output. So **PD-disaggregation adds 8 100 × decode-time of transfer on GB/s`. For agentic this is **8 ms (192 MiB / trace lower) 1.9 s
top of every routed request**. Phase isolation (the thing PD-disagg (11.5 GiB / p99)**.
trades transfer cost for) can only win back at most one decode duration
— for agentic that's negligible. The arithmetic is one-sided. **⚠ Correction (2026-05-27)**: an earlier version of this README
framed §3.2 as "transfer cost (1.510 s) >> decode duration (50200 ms),
so PD-disagg loses on cost-vs-benefit." That accounting was wrong:
PD-disagg's phase-isolation benefit is **per-prefill-event** and equals
`D × T_prefill` (aggregate across stalled decode streams), not the
single-request decode duration. With trace-mean `T_prefill = 4.5 s` and
D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer
cost. PD-disagg's phase-isolation axis is a *win*, not a loss.
The actual reason static PD-disagg fails in agentic is **D-side KV
capacity** (`figs/f4b_pdsep_kv_wall.png`), not a cost-vs-benefit
imbalance. See `RESULTS_SUMMARY.md` section 4 for the corrected
framing. MB2 still serves as the source of the per-request transfer
cost number used in that analysis.
--- ---
@@ -137,43 +150,44 @@ analysis; not done yet.
treats them as additional samples (same sizes); the per-size treats them as additional samples (same sizes); the per-size
aggregates use all of them. aggregates use all of them.
## Implications for §3.2 PD-disagg cost argument ## Implications for §3.2 PD-disagg argument
For each PD-disagg-routed request, transfer wall-time is: For each PD-disagg-routed request, transfer wall-time is:
``` ```
T_transfer(KV_size) = max( pure_transfer(KV_size), rx_overhead ) T_transfer(KV_size) ≈ KV_size / 9.7 GB/s for KV_size ≤ 3 GiB
≈ KV_size / 9.7 GB/s for KV_size <= 3 GiB
≈ 0.3 10 s for KV_size in [3, 12] GiB ≈ 0.3 10 s for KV_size in [3, 12] GiB
``` ```
Agentic decode wall-time is typically 50 200 ms (tool-call output of This is the **per-request transfer charge** of PD-disagg. It's a
a few tens of tokens at ~50 tok/s). So the **transfer/decode ratio** real cost, but in the context of phase-isolation accounting it is
under intra-node best-case Mooncake is: *small* compared to the benefit:
| KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode | | Prefill | T_prefill (MB1) | T_transfer (MB2) | Phase-isolation benefit at D=8 = D × T_prefill |
|---|---:|---:|---:| |---:|---:|---:|---:|
| 192 MiB (2k tok) | 20 ms | 100 ms | 0.2× | | 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s |
| 768 MiB (8k tok) | 84 ms | 100 ms | 0.8× | | 33k tok (trace mean) | 4.5 s | 320 ms | 36 s |
| 3 GiB (33k tok ≈ trace mean) | 321 ms | 100 ms | **3.2×** | | 125k tok (~p99) | 57 s | 1.9 s | 456 s |
| 6 GiB (~p90) | 1900 ms | 100 ms | **19×** |
| 12 GiB (~p99) | 2800 ms | 100 ms | **28×** (median) **100×** (p99 variance) |
PD-disagg's promised payoff is *eliminating prefilldecode interference On the phase-isolation axis alone, PD-disagg recovers two orders of
on the decode instance*. The maximum benefit it can buy is bounded magnitude more decode time than it pays in transfer. **It is NOT this
above by the **decode duration itself** (you cannot recover more time axis that defeats static PD-disagg in agentic** — see colleague's
than the decode existed). For agentic that's 50 200 ms. The cost is 4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is
the table column above — 0.3 10 s of transfer per routed request. driven by **D-side KV-pool overflow** on long-context requests
(`figs/f4b_pdsep_kv_wall.png`), not by transfer latency.
**Cost > Benefit by 5× to 100× across the agentic distribution.** Below What MB2 contributes to the paper is therefore:
~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above - The **per-request transfer cost number** (used as the cost input
6 GiB even individual draws can take 10 s for a single transfer. to the cost-benefit accounting above).
- The empirical observation that **Mooncake's transfer cost is
topology-independent** — intra-node and inter-node both go through
the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's
transfer cost does not get cheaper by co-locating P and D.
This data alone is not the whole §3.2 argument — we still need to The dominant §3.2 failure mode of static PD-disagg in agentic is
account for D-side KV capacity (`f4b`, separate axis), cache reuse loss, **capacity**, not transfer cost. MB3 / MB4 / MB5 will quantify the
and static-partition mismatch (MB3 / MB4 / MB5). But it nails one remaining axes (D-pool occupancy, cache reuse degradation under PD
of the two key cost axes with measured numbers from vanilla mooncake, routing, static-partition mismatch).
not the dash0 patched build.
## Open questions / next runs ## Open questions / next runs

Binary file not shown.

Before

Width:  |  Height:  |  Size: 122 KiB

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 161 KiB

View File

@@ -1,20 +1,19 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure. """Plot MB1 phase-interference data.
Two outputs: Single output: figs/mb1_interference.png — effective per-stream TPOT
during a prefill burst, vs prefill size, one line per concurrent decode
batch size D.
mb1_interference.png Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
Effective TPOT during prefill vs prefill size, one line per D. which composed a "max PD-disagg benefit = decode duration (50200 ms)
Log-log. Annotates typical agentic decode duration (~100 ms) as a band" against the MB2 transfer-cost curve. That accounting was wrong
horizontal band so reader can spot when decode would be stalled. (see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
per-prefill-event, equal to D × T_prefill across stalled streams, not
pd_cost_vs_benefit.png capped by a single request's decode duration. That figure has been
The §3.2 headline. X axis: KV size (MiB). Two stacked curves: removed; the math it implied was structurally backwards. The dominant
- benefit ceiling (MB1) — at most one decode-duration per request reason static PD-disagg fails in agentic is **D-side KV capacity**
of phase isolation can be recovered. Drawn as a flat 100 ms line. (see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
- cost (MB2) — Mooncake pure_transfer p50 at that size.
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
structurally loses.
""" """
from __future__ import annotations from __future__ import annotations
@@ -25,21 +24,16 @@ from pathlib import Path
import matplotlib import matplotlib
matplotlib.use("Agg") matplotlib.use("Agg")
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np
def main() -> None: def main() -> None:
p = argparse.ArgumentParser() p = argparse.ArgumentParser()
p.add_argument("--mb1", type=Path, required=True) p.add_argument("--mb1", type=Path, required=True)
p.add_argument("--mb2-intra", type=Path, required=True) p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
p.add_argument("--mb2-inter", type=Path, default=None)
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
args = p.parse_args() args = p.parse_args()
mb1 = json.loads(args.mb1.read_text())["summary"] mb1 = json.loads(args.mb1.read_text())["summary"]
# ---- mb1_interference.png ----
fig, ax = plt.subplots(figsize=(9, 5.5)) fig, ax = plt.subplots(figsize=(9, 5.5))
Ds = sorted({s["decode_batch_size"] for s in mb1}) Ds = sorted({s["decode_batch_size"] for s in mb1})
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"} colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
@@ -50,79 +44,19 @@ def main() -> None:
ys = [s["effective_tpot_during_ms"] for s in rows] ys = [s["effective_tpot_during_ms"] for s in rows]
ax.plot(xs, ys, "o-", lw=2, markersize=7, ax.plot(xs, ys, "o-", lw=2, markersize=7,
color=colors.get(D, "gray"), color=colors.get(D, "gray"),
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)") label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
(100, "agentic decode (~100 ms)"),
(200, "long agentic decode (~200 ms)")]:
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
ax.set_xscale("log"); ax.set_yscale("log") ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlabel("Prefill burst size (tokens, log)") ax.set_xlabel("Prefill burst size (tokens, log)")
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)") ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n" ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)") "(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
"Per-prefill aggregate stall = D × T_prefill.")
ax.grid(True, which="both", alpha=0.3) ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9) ax.legend(loc="upper left", fontsize=9)
args.out_interf.parent.mkdir(parents=True, exist_ok=True) args.out.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig) fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
print(f"wrote {args.out_interf}") print(f"wrote {args.out}")
# ---- pd_cost_vs_benefit.png ----
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
fig, ax = plt.subplots(figsize=(9, 5.5))
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
if args.mb2_inter:
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
inter_x = [s["kv_mib"] for s in mb2_inter]
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
alpha=0.7, label="MB2 inter-node (same numbers)")
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
label="MB1 max benefit ≤ agentic decode (~100 ms)")
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
label="benefit range (50200 ms decode)")
# Mark agentic-tail request sizes
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
(3072, "p90\n(~33k tok)"),
(6144, "p95\n(~65k tok)"),
(11500, "p99\n(11.5 GiB)")]:
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
ha="center", va="bottom")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlim(40, 14000)
ax.set_ylim(1, 12000)
ax.set_xlabel("Per-request KV size (MiB, log)")
ax.set_ylabel("Time per request (ms, log)")
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
# Add explanatory annotation
ax.text(10000, 5000,
"Cost > benefit for ANY KV size above\n"
"the green band (~80 MiB / ~830 tokens).\n"
"Below: cost is marginal (<10 ms) but\n"
"benefit is also small (decode is short).",
fontsize=9, color="#333",
ha="right", va="top",
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
print(f"wrote {args.out_cb}")
if __name__ == "__main__": if __name__ == "__main__":