Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 22:04:49 +08:00
parent abde010b64
commit da39ab6804
6 changed files with 155 additions and 178 deletions

View File

@@ -39,52 +39,63 @@ Production trace = Qwen3-Coder agentic1.3 M sessions / 2.1 M reqs / 7200 s。
参考图:`figs/f4a_apc_loss.png``figs/f4b_pdsep_kv_wall.png``figs/f4c_per_worker_ttft.png``figs/f6_e2e_latency_bars.png``figs/f6_e2e_latency_full_grid.png`
## 4. PD-disagg 在 agentic 下输——cost vs benefit§3.2
## 4. Static PD-disagg 为什么失败§3.2)—— 容量问题,不是 cost-benefit 问题
由两个独立 microbench 钉死(**全用 vanilla vLLM 0.18.1 + Mooncake 0.3.11fresh venv无 patch**
**2026-05-27 纠正**:本节前一版本论证"PD-disagg 因为 transfer cost > phase isolation benefit 而失败"。**这个论证算错了**。正确的 phase-isolation benefit 是**每个 prefill 事件 × D 个 concurrent stream** 的总和(≈ `D × T_prefill`),不是单个 request 的 decode 时长。用正确公式PD-disagg 在 phase-isolation 这一维上**赢 colo 一两个数量级**。Static PD-disagg 在 agentic 上失败的**真正根因是 D 侧 KV pool 容量**,不是这一维
### 4.1 MB2 — KV transfer cost
### 4.1 真正的失败模式D 侧 KV 容量天花板
dash1 GPU 0+1intra-node和 dash1 ↔ dash2inter-node, 200 Gbps RoCE扫 9 个 size × 5 reps。
| | 8C colo | 4P+4D PD-disagg |
|---|---:|---:|
| Per-D-instance KV pool0.4 × 96 GiB | 38 GiB | 38 GiB |
| 系统总 decode 容量D 实例数 × 单池) | 8 × 38 = **304 GiB** | 4 × 38 = **152 GiB** |
| p99 单请求 KV = 11.5 GiB → 最多并发 decode | 24 | **12减半** |
| 路径 | 稳态带宽(≤ 3 GiB | p99 agentic 请求11.5 GiBtransfer 时间 |
Colleague 4P+4D 实测TTFT p50 0.91 s → **62.8 s62×**、success rate **99.5% → 52%**。失败模式:**D 池溢出 + 排队**,不是 transfer 延迟。
参考图:`figs/f4b_pdsep_kv_wall.png`pdf 版本是高质量 paper figure
### 4.2 MB2 — KV transfer costper-request 一次性成本,**不**是 dominant cost
dash1 GPU 0+1intra和 dash1 ↔ dash2inter, 200 Gbps RoCE扫 9 个 size × 5 reps。
| 路径 | 稳态带宽(≤ 3 GiB | p99 agentic 请求 11.5 GiB transfer |
|---|---|---|
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · min 1.5 s · max 10 s |
| Inter-node | **10.0 GB/s**(差 <3% | p50 **1.7 s** · min 1.3 s · max 9.2 s |
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · max 10 s |
| Inter-node | **10.0 GB/s**(差 <3% | p50 **1.7 s** · max 9.2 s |
**新发现**intra/inter 几乎重合 **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC,包括 intra-node loopback**不走 NVLink200 Gbps NIC 是天花板**PD-disagg transfer cost 与拓扑无关**。
**新发现**intra/inter 几乎重合 **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC**不走 NVLink200 Gbps NIC 是天花板**PD-disagg transfer cost 与拓扑无关**。
参考图`figs/mb2_transfer_time_compare.png``figs/mb2_transfer_bw_compare.png`doc `analysis/mb2/README.md`
参考图`figs/mb2_transfer_time_compare.png`doc `analysis/mb2/README.md`
### 4.2 MB1 — Phase interferencechunked-prefill on, 默认 baseline
### 4.3 MB1 — Phase interferencePD-disagg 的潜在 benefit 上界
dash1 GPU 0 instanceDconcurrent decodes× Pprefill size扫描
dash1 GPU 0 instance kv_connectorchunked-prefill 默认开启D × P 扫描D=8 结果
D=8 agentic-realistic的结果
| Prefill | prefill_ttft | per-stream TPOT during | penalty |
| Prefill | T_prefill | per-stream TPOT during | penalty |
|---|---:|---:|---:|
| 2k tok | 143 ms | 32 ms | 4× |
| 8k | 583 ms | 114 ms | 15× |
| 32k | 4.5 s | 388 ms | **52×** |
| 65k | 15.6 s | 757 ms | **99×** |
| 131k | 57 s | 1419 ms | **183×** |
| 32k tok | 4.5 s | 388 ms | **52×** |
| 131k tok | 57 s | 1419 ms | **183×** |
baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunked-prefill 已经默认开启PD-disagg 在它之上能额外提供的 phase isolation = **decode 在 prefill 期间被 halted 的那部分时间**
**Decode 在 prefill 期间被几乎完全 halted** stream 损失 `T_prefill` 的时间。**每个 prefill event decode 损失 `D × T_prefill`**
参考图`figs/mb1_interference.png`doc `analysis/mb1/README.md`
### 4.3 联合结论
### 4.4 联合 cost-benefitper-prefill event
| | Per-request |
|---|---|
| **Max PD-disagg benefit**救回来的 decode 时间| **decode 时长 = 50200 ms**agentic tool-call output|
| **PD-disagg cost**MB2 transfer p50| 80 MiB 8 ms · 3 GiB 320 ms · 11.5 GiB **1.9 s**p99 实测最差 10 s|
| Cost / Benefit | **每个 KV ≥ 80 MiB 的请求都输**trace 平均 KV 192 MiB 已经输 |
| Prefill (KV size) | T_prefill | Cost = T_transfer | Benefit = D × T_prefill (D=8) | Cost / Benefit |
|---:|---:|---:|---:|---:|
| 2k tok (192 MiB) | 0.14 s | 8 ms | 1.1 s | **0.7%** |
| 33k tok (3 GiB, trace mean) | 4.5 s | 0.32 s | 36 s | **0.9%** |
| 125k tok (12 GiB, ~p99) | 57 s | 1.9 s | 456 s | **0.4%** |
**结论** agentic **PD-disaggregation 是结构性失败的**Chunked-prefill 默认已经在 colocation 内做了 first-order phase isolationPD-disagg 在此之上能额外补的decode 短时段没被 prefill 小于它新带来的每个 routed 请求都付 KV transfer)。这个结论与拓扑无关intra-node inter-node 一样)。
**PD-disagg 在 phase-isolation 这一维赢 100×250×****这不是 §3.2 该用的论证**因为 §3.2 真正的 dominant failure §4.1 D 池容量天花板颠覆了上面的全部数学)。
参考图`figs/pd_cost_vs_benefit.png`(§3.2 headline)。
**总结**
- D KV 容量天花板(§4.1)→ PD-disagg agentic **结构性失败**。
- MB1 + MB2 的合计 cost-benefit phase isolation 维度上 PD-disagg 是赢的**但这件事被容量天花板压倒**。
- Paper §3.2 论证应该聚焦"D 池装不下"MB1/MB2 数据用作 supporting contextper-request transfer charge 量化phase isolation benefit 量化而不是 main argument
## 5. EAR 设计的实证状态§4
@@ -96,10 +107,12 @@ baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunke
## 6. 已经能写的 paper 主张(按 confidence 排序)
1. **Agentic vs chatbot 在调度上是不同 regime**dispatch coupling + sub-second tool-call mass)—— 实证完整
2. **PD-disaggregation 在 agentic 下输**cost > benefit跨拓扑—— **MB1 + MB2 实证完整**
3. **三类现有调度 baseline 各自的失败模式** —— 实证完整
4. **Affinity-default 调度current unified达到 APC 上界**per-worker latency 也压倒 sticky —— 实证完整
5. **Hot-triggered migration 修复 sticky 的 hot pin** —— **design 完整、e2e 待验证**
2. **三类现有调度 baseline 各自的失败模式**load-balance / static PD-disagg / pure sticky)—— 实证完整
3. **Static PD-disagg 在 agentic 下失败的 dominant 根因是 D 侧 KV 容量**不是 phase-isolation cost-benefit)—— 实证完整`f4b` + colleague 4P+4D 数据
4. **Mooncake transfer cost 拓扑无关**intra inter~9.7 GB/s NIC 上限)—— 实证完整MB2
5. **Phase isolation interference 在 chunked-prefill on 下仍然显著**per-stream TPOT during prefill 15×2000× baseline)—— 实证完整MB1)。**注意**这条数据本身不直接论证 "PD-disagg 失败"因为算正确账后 PD-disagg 反而在这一维上赢它的用途是给 §3.2 提供 phase-isolation benefit 上界的量化
6. **Affinity-default 调度current unified达到 APC 上界**per-worker latency 也压倒 sticky —— 实证完整
7. **Hot-triggered migration 修复 sticky 的 hot pin** —— design 完整e2e 待验证
## 7. 待做

View File

@@ -15,20 +15,28 @@ bottom; the **Summary** block is what gets cited.
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
| Maximum PD-disagg benefit per agentic decode | **≤ 50200 ms** (= decode duration) |
**§3.2 headline (cost vs benefit, this run + MB2)**:
**What MB1 actually measures**:
> Under chunked-prefill, every ongoing decode stream is essentially
> **halted while a prefill chunk is in flight** — per-stream effective
> TPOT during the burst is 15× to 2000× baseline, scaling with prefill
> size. PD-disagg can recover this stall, but the recovery is bounded by
> the **decode duration** of the request being protected. For agentic,
> decode is 50200 ms (tool-call output). MB2 shows PD-disagg pays
> 300 ms 10 s of KV-transfer cost per request to do that recovery. The
> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB
> (~830 tokens) — well below all agentic operating points. The benefit
> never beats the cost in this workload.
> During a prefill burst, every ongoing decode stream is essentially
> halted (per-stream effective TPOT is 15×2000× baseline, scaling with
> prefill size). The **total decode time lost per prefill event is
> `D × T_prefill`** (D concurrent decodes each lose ~T_prefill of useful
> work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8
> that's **~36 seconds of decode-equivalent work lost per request**.
> This is the **upper bound on what PD-disaggregation's phase isolation
> could recover** on the decode side.
**⚠ Correction (2026-05-27)**: an earlier version of this README framed
the §3.2 PD-disagg argument as "phase-isolation benefit is capped at
the decode duration of the new request (~50200 ms), so MB2 transfer
cost dominates". That framing was wrong. The correct accounting is
benefit-per-prefill-event = D × T_prefill (aggregate decode time saved
across all stalled streams), which is **much larger than per-request
transfer cost**. The actual reason static PD-disagg fails in agentic
is **D-side KV pool capacity** (`figs/f4b_pdsep_kv_wall.png`), not a
cost-vs-benefit imbalance on phase isolation. See `RESULTS_SUMMARY.md`
section 4 for the corrected framing.
## Setup
@@ -107,25 +115,30 @@ the cleanest "average over the whole burst window" number).
halts decode). This is the entire prefill duration's worth of decode
time that could in principle be recovered.
Two big caveats for **agentic** application:
**Connecting to the §3.2 PD-disagg argument** (corrected):
1. **Decode is short** (~50200 ms for tool-call output). The actual
recoverable benefit per request is bounded by the decode duration,
not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill
collides with it, PD-disagg can save at most 100 ms — not 5 s.
2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms 10 s per request
for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
~100 ms benefit ceiling. Cost > benefit across the whole agentic
distribution.
PD-disagg's promised phase-isolation benefit is **per prefill event**,
not per request. When a new prefill arrives, it stalls every concurrent
decode stream on the same GPU. The aggregate decode time lost across
those D streams is `D × T_prefill`. PD-disagg moving prefill off-decode-GPU
recovers all of it.
## §3.2 cost-vs-benefit figure
Plugging numbers per prefill event:
`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50200 ms
band, capped by decode duration) on top of MB2 transfer cost curve. The
cost curve crosses the benefit ceiling somewhere around **80 MiB / 830
tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace
mean per request KV, and we know agentic averages 33k tokens, p99
125k). For anything bigger PD-disagg pays more than it can recover.
| Prefill size | T_prefill | PD-disagg cost (MB2 T_transfer) | PD-disagg benefit (D=8 × T_prefill) | Ratio |
|---:|---:|---:|---:|---:|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s | 0.7 % |
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s | 0.9 % |
| 125k tok (~p99) | 57 s | 1.9 s | 456 s | 0.4 % |
On the **phase-isolation axis alone**, PD-disagg wins by 100×250×.
The reason static PD-disagg nonetheless **fails in agentic** is a
*different* failure mode: the D-side KV pool cannot fit p90+ requests
(p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide
decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's
4P+4D experiment). The structural problem is **capacity** (see
`figs/f4b_pdsep_kv_wall.png`), not transfer-cost vs phase-isolation
trade-off.
## Reproduction
@@ -174,4 +187,7 @@ ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`.
Figure: `figs/mb1_interference.png`. The figure
`figs/pd_cost_vs_benefit.png` from the original commit `029821c` was
based on the wrong "benefit ≤ decode duration" accounting; **deleted in
the correction commit**.

View File

@@ -24,12 +24,25 @@ get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
ceiling applies regardless. Halving the transfer cost cannot be bought
back by topology.
**Headline for the paper §3.2**: at the agentic tail, **pure KV transfer
takes 1.5 10 s**. A median agentic decode is **50 200 ms** of tool-call
output. So **PD-disaggregation adds 8 100 × decode-time of transfer on
top of every routed request**. Phase isolation (the thing PD-disagg
trades transfer cost for) can only win back at most one decode duration
— for agentic that's negligible. The arithmetic is one-sided.
**What MB2 actually measures**: the **per-request charge** that
PD-disagg pays for every routed request — `T_transfer ≈ KV_size / 9.7
GB/s`. For agentic this is **8 ms (192 MiB / trace lower) 1.9 s
(11.5 GiB / p99)**.
**⚠ Correction (2026-05-27)**: an earlier version of this README
framed §3.2 as "transfer cost (1.510 s) >> decode duration (50200 ms),
so PD-disagg loses on cost-vs-benefit." That accounting was wrong:
PD-disagg's phase-isolation benefit is **per-prefill-event** and equals
`D × T_prefill` (aggregate across stalled decode streams), not the
single-request decode duration. With trace-mean `T_prefill = 4.5 s` and
D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer
cost. PD-disagg's phase-isolation axis is a *win*, not a loss.
The actual reason static PD-disagg fails in agentic is **D-side KV
capacity** (`figs/f4b_pdsep_kv_wall.png`), not a cost-vs-benefit
imbalance. See `RESULTS_SUMMARY.md` section 4 for the corrected
framing. MB2 still serves as the source of the per-request transfer
cost number used in that analysis.
---
@@ -137,43 +150,44 @@ analysis; not done yet.
treats them as additional samples (same sizes); the per-size
aggregates use all of them.
## Implications for §3.2 PD-disagg cost argument
## Implications for §3.2 PD-disagg argument
For each PD-disagg-routed request, transfer wall-time is:
```
T_transfer(KV_size) = max( pure_transfer(KV_size), rx_overhead )
≈ KV_size / 9.7 GB/s for KV_size <= 3 GiB
T_transfer(KV_size) ≈ KV_size / 9.7 GB/s for KV_size ≤ 3 GiB
≈ 0.3 10 s for KV_size in [3, 12] GiB
```
Agentic decode wall-time is typically 50 200 ms (tool-call output of
a few tens of tokens at ~50 tok/s). So the **transfer/decode ratio**
under intra-node best-case Mooncake is:
This is the **per-request transfer charge** of PD-disagg. It's a
real cost, but in the context of phase-isolation accounting it is
*small* compared to the benefit:
| KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode |
|---|---:|---:|---:|
| 192 MiB (2k tok) | 20 ms | 100 ms | 0.2× |
| 768 MiB (8k tok) | 84 ms | 100 ms | 0.8× |
| 3 GiB (33k tok ≈ trace mean) | 321 ms | 100 ms | **3.2×** |
| 6 GiB (~p90) | 1900 ms | 100 ms | **19×** |
| 12 GiB (~p99) | 2800 ms | 100 ms | **28×** (median) **100×** (p99 variance) |
| Prefill | T_prefill (MB1) | T_transfer (MB2) | Phase-isolation benefit at D=8 = D × T_prefill |
|---:|---:|---:|---:|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s |
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s |
| 125k tok (~p99) | 57 s | 1.9 s | 456 s |
PD-disagg's promised payoff is *eliminating prefilldecode interference
on the decode instance*. The maximum benefit it can buy is bounded
above by the **decode duration itself** (you cannot recover more time
than the decode existed). For agentic that's 50 200 ms. The cost is
the table column above — 0.3 10 s of transfer per routed request.
On the phase-isolation axis alone, PD-disagg recovers two orders of
magnitude more decode time than it pays in transfer. **It is NOT this
axis that defeats static PD-disagg in agentic** — see colleague's
4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is
driven by **D-side KV-pool overflow** on long-context requests
(`figs/f4b_pdsep_kv_wall.png`), not by transfer latency.
**Cost > Benefit by 5× to 100× across the agentic distribution.** Below
~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above
6 GiB even individual draws can take 10 s for a single transfer.
What MB2 contributes to the paper is therefore:
- The **per-request transfer cost number** (used as the cost input
to the cost-benefit accounting above).
- The empirical observation that **Mooncake's transfer cost is
topology-independent** — intra-node and inter-node both go through
the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's
transfer cost does not get cheaper by co-locating P and D.
This data alone is not the whole §3.2 argument — we still need to
account for D-side KV capacity (`f4b`, separate axis), cache reuse loss,
and static-partition mismatch (MB3 / MB4 / MB5). But it nails one
of the two key cost axes with measured numbers from vanilla mooncake,
not the dash0 patched build.
The dominant §3.2 failure mode of static PD-disagg in agentic is
**capacity**, not transfer cost. MB3 / MB4 / MB5 will quantify the
remaining axes (D-pool occupancy, cache reuse degradation under PD
routing, static-partition mismatch).
## Open questions / next runs

Binary file not shown.

Before

Width:  |  Height:  |  Size: 122 KiB

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 161 KiB

View File

@@ -1,20 +1,19 @@
#!/usr/bin/env python3
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
"""Plot MB1 phase-interference data.
Two outputs:
Single output: figs/mb1_interference.png — effective per-stream TPOT
during a prefill burst, vs prefill size, one line per concurrent decode
batch size D.
mb1_interference.png
Effective TPOT during prefill vs prefill size, one line per D.
Log-log. Annotates typical agentic decode duration (~100 ms) as a
horizontal band so reader can spot when decode would be stalled.
pd_cost_vs_benefit.png
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
- benefit ceiling (MB1) — at most one decode-duration per request
of phase isolation can be recovered. Drawn as a flat 100 ms line.
- cost (MB2) — Mooncake pure_transfer p50 at that size.
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
structurally loses.
Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
which composed a "max PD-disagg benefit = decode duration (50200 ms)
band" against the MB2 transfer-cost curve. That accounting was wrong
(see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
per-prefill-event, equal to D × T_prefill across stalled streams, not
capped by a single request's decode duration. That figure has been
removed; the math it implied was structurally backwards. The dominant
reason static PD-disagg fails in agentic is **D-side KV capacity**
(see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
"""
from __future__ import annotations
@@ -25,21 +24,16 @@ from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--mb1", type=Path, required=True)
p.add_argument("--mb2-intra", type=Path, required=True)
p.add_argument("--mb2-inter", type=Path, default=None)
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
args = p.parse_args()
mb1 = json.loads(args.mb1.read_text())["summary"]
# ---- mb1_interference.png ----
fig, ax = plt.subplots(figsize=(9, 5.5))
Ds = sorted({s["decode_batch_size"] for s in mb1})
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
@@ -50,79 +44,19 @@ def main() -> None:
ys = [s["effective_tpot_during_ms"] for s in rows]
ax.plot(xs, ys, "o-", lw=2, markersize=7,
color=colors.get(D, "gray"),
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
(100, "agentic decode (~100 ms)"),
(200, "long agentic decode (~200 ms)")]:
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlabel("Prefill burst size (tokens, log)")
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
"(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
"Per-prefill aggregate stall = D × T_prefill.")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
print(f"wrote {args.out_interf}")
# ---- pd_cost_vs_benefit.png ----
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
fig, ax = plt.subplots(figsize=(9, 5.5))
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
if args.mb2_inter:
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
inter_x = [s["kv_mib"] for s in mb2_inter]
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
alpha=0.7, label="MB2 inter-node (same numbers)")
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
label="MB1 max benefit ≤ agentic decode (~100 ms)")
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
label="benefit range (50200 ms decode)")
# Mark agentic-tail request sizes
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
(3072, "p90\n(~33k tok)"),
(6144, "p95\n(~65k tok)"),
(11500, "p99\n(11.5 GiB)")]:
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
ha="center", va="bottom")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlim(40, 14000)
ax.set_ylim(1, 12000)
ax.set_xlabel("Per-request KV size (MiB, log)")
ax.set_ylabel("Time per request (ms, log)")
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
# Add explanatory annotation
ax.text(10000, 5000,
"Cost > benefit for ANY KV size above\n"
"the green band (~80 MiB / ~830 tokens).\n"
"Below: cost is marginal (<10 ms) but\n"
"benefit is also small (decode is short).",
fontsize=9, color="#333",
ha="right", va="top",
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
print(f"wrote {args.out_cb}")
args.out.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
print(f"wrote {args.out}")
if __name__ == "__main__":