Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -39,52 +39,63 @@ Production trace = Qwen3-Coder agentic,1.3 M sessions / 2.1 M reqs / 7200 s。
|
|||||||
|
|
||||||
参考图:`figs/f4a_apc_loss.png`、`figs/f4b_pdsep_kv_wall.png`、`figs/f4c_per_worker_ttft.png`、`figs/f6_e2e_latency_bars.png`、`figs/f6_e2e_latency_full_grid.png`。
|
参考图:`figs/f4a_apc_loss.png`、`figs/f4b_pdsep_kv_wall.png`、`figs/f4c_per_worker_ttft.png`、`figs/f6_e2e_latency_bars.png`、`figs/f6_e2e_latency_full_grid.png`。
|
||||||
|
|
||||||
## 4. PD-disagg 在 agentic 下输——cost vs benefit(§3.2)
|
## 4. Static PD-disagg 为什么失败(§3.2)—— 容量问题,不是 cost-benefit 问题
|
||||||
|
|
||||||
由两个独立 microbench 钉死(**全用 vanilla vLLM 0.18.1 + Mooncake 0.3.11,fresh venv,无 patch**)。
|
⚠ **2026-05-27 纠正**:本节前一版本论证"PD-disagg 因为 transfer cost > phase isolation benefit 而失败"。**这个论证算错了**。正确的 phase-isolation benefit 是**每个 prefill 事件 × D 个 concurrent stream** 的总和(≈ `D × T_prefill`),不是单个 request 的 decode 时长。用正确公式,PD-disagg 在 phase-isolation 这一维上**赢 colo 一两个数量级**。Static PD-disagg 在 agentic 上失败的**真正根因是 D 侧 KV pool 容量**,不是这一维。
|
||||||
|
|
||||||
### 4.1 MB2 — KV transfer cost
|
### 4.1 真正的失败模式:D 侧 KV 容量天花板
|
||||||
|
|
||||||
dash1 GPU 0+1(intra-node)和 dash1 ↔ dash2(inter-node, 200 Gbps RoCE)扫 9 个 size × 5 reps。
|
| | 8C colo | 4P+4D PD-disagg |
|
||||||
|
|---|---:|---:|
|
||||||
|
| Per-D-instance KV pool(0.4 × 96 GiB) | 38 GiB | 38 GiB |
|
||||||
|
| 系统总 decode 容量(D 实例数 × 单池) | 8 × 38 = **304 GiB** | 4 × 38 = **152 GiB** |
|
||||||
|
| p99 单请求 KV = 11.5 GiB → 最多并发 decode | 24 | **12(减半)** |
|
||||||
|
|
||||||
| 路径 | 稳态带宽(≤ 3 GiB) | p99 agentic 请求(11.5 GiB)transfer 时间 |
|
Colleague 4P+4D 实测:TTFT p50 0.91 s → **62.8 s(62×)**、success rate **99.5% → 52%**。失败模式:**D 池溢出 + 排队**,不是 transfer 延迟。
|
||||||
|
|
||||||
|
参考图:`figs/f4b_pdsep_kv_wall.png`(pdf 版本是高质量 paper figure)。
|
||||||
|
|
||||||
|
### 4.2 MB2 — KV transfer cost(per-request 一次性成本,**不**是 dominant cost)
|
||||||
|
|
||||||
|
dash1 GPU 0+1(intra)和 dash1 ↔ dash2(inter, 200 Gbps RoCE)扫 9 个 size × 5 reps。
|
||||||
|
|
||||||
|
| 路径 | 稳态带宽(≤ 3 GiB) | p99 agentic 请求 11.5 GiB transfer |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · min 1.5 s · max 10 s |
|
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · max 10 s |
|
||||||
| Inter-node | **10.0 GB/s**(差 <3%) | p50 **1.7 s** · min 1.3 s · max 9.2 s |
|
| Inter-node | **10.0 GB/s**(差 <3%) | p50 **1.7 s** · max 9.2 s |
|
||||||
|
|
||||||
**新发现**:intra/inter 几乎重合 → **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC,包括 intra-node loopback**,不走 NVLink。200 Gbps NIC 是天花板,**PD-disagg 的 transfer cost 与拓扑无关**。
|
**新发现**:intra/inter 几乎重合 → **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC**,不走 NVLink。200 Gbps NIC 是天花板。**PD-disagg transfer cost 与拓扑无关**。
|
||||||
|
|
||||||
参考图:`figs/mb2_transfer_time_compare.png`、`figs/mb2_transfer_bw_compare.png`、doc `analysis/mb2/README.md`。
|
参考图:`figs/mb2_transfer_time_compare.png`、doc `analysis/mb2/README.md`。
|
||||||
|
|
||||||
### 4.2 MB1 — Phase interference(chunked-prefill on, 默认 baseline)
|
### 4.3 MB1 — Phase interference(PD-disagg 的潜在 benefit 上界)
|
||||||
|
|
||||||
dash1 GPU 0 单 instance,D(concurrent decodes)× P(prefill size)扫描。
|
dash1 GPU 0 单 instance(无 kv_connector),chunked-prefill 默认开启,D × P 扫描。D=8 结果:
|
||||||
|
|
||||||
D=8(最 agentic-realistic)的结果:
|
| Prefill | T_prefill | per-stream TPOT during | penalty |
|
||||||
|
|
||||||
| Prefill | prefill_ttft | per-stream TPOT during | penalty |
|
|
||||||
|---|---:|---:|---:|
|
|---|---:|---:|---:|
|
||||||
| 2k tok | 143 ms | 32 ms | 4× |
|
| 2k tok | 143 ms | 32 ms | 4× |
|
||||||
| 8k | 583 ms | 114 ms | 15× |
|
| 32k tok | 4.5 s | 388 ms | **52×** |
|
||||||
| 32k | 4.5 s | 388 ms | **52×** |
|
| 131k tok | 57 s | 1419 ms | **183×** |
|
||||||
| 65k | 15.6 s | 757 ms | **99×** |
|
|
||||||
| 131k | 57 s | 1419 ms | **183×** |
|
|
||||||
|
|
||||||
baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunked-prefill 已经默认开启,PD-disagg 在它之上能额外提供的 phase isolation = **decode 在 prefill 期间被 halted 的那部分时间**。
|
**Decode 在 prefill 期间被几乎完全 halted**,单 stream 损失 ≈ `T_prefill` 的时间。**每个 prefill event 总 decode 损失 ≈ `D × T_prefill`**。
|
||||||
|
|
||||||
参考图:`figs/mb1_interference.png`、doc `analysis/mb1/README.md`。
|
参考图:`figs/mb1_interference.png`、doc `analysis/mb1/README.md`。
|
||||||
|
|
||||||
### 4.3 联合结论
|
### 4.4 联合 cost-benefit(per-prefill event)
|
||||||
|
|
||||||
| | Per-request |
|
| Prefill (KV size) | T_prefill | Cost = T_transfer | Benefit = D × T_prefill (D=8) | Cost / Benefit |
|
||||||
|---|---|
|
|---:|---:|---:|---:|---:|
|
||||||
| **Max PD-disagg benefit**(救回来的 decode 时间)| ≤ **decode 时长 = 50–200 ms**(agentic tool-call output)|
|
| 2k tok (192 MiB) | 0.14 s | 8 ms | 1.1 s | **0.7%** |
|
||||||
| **PD-disagg cost**(MB2 transfer p50)| 80 MiB ≈ 8 ms · 3 GiB ≈ 320 ms · 11.5 GiB ≈ **1.9 s**(p99 实测最差 10 s)|
|
| 33k tok (3 GiB, trace mean) | 4.5 s | 0.32 s | 36 s | **0.9%** |
|
||||||
| Cost / Benefit | **每个 KV ≥ 80 MiB 的请求都输**;trace 平均 KV 192 MiB → 已经输 |
|
| 125k tok (12 GiB, ~p99) | 57 s | 1.9 s | 456 s | **0.4%** |
|
||||||
|
|
||||||
**结论**:在 agentic 上 **PD-disaggregation 是结构性失败的**。Chunked-prefill 默认已经在 colocation 内做了 first-order phase isolation;PD-disagg 在此之上能额外补的(decode 短时段没被 prefill 挤)小于它新带来的(每个 routed 请求都付 KV transfer)。这个结论与拓扑无关(intra-node 和 inter-node 一样)。
|
**PD-disagg 在 phase-isolation 这一维赢 100×–250×**。但**这不是 §3.2 该用的论证**,因为 §3.2 真正的 dominant failure 是 §4.1 的 D 池容量天花板(颠覆了上面的全部数学)。
|
||||||
|
|
||||||
参考图:`figs/pd_cost_vs_benefit.png`(§3.2 headline)。
|
**总结**:
|
||||||
|
- D 侧 KV 容量天花板(§4.1)→ PD-disagg 在 agentic 上**结构性失败**。
|
||||||
|
- MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的,**但这件事被容量天花板压倒**。
|
||||||
|
- Paper §3.2 论证应该聚焦"D 池装不下",MB1/MB2 数据用作 supporting context(per-request transfer charge 量化、phase isolation benefit 量化)而不是 main argument。
|
||||||
|
|
||||||
## 5. EAR 设计的实证状态(§4)
|
## 5. EAR 设计的实证状态(§4)
|
||||||
|
|
||||||
@@ -96,10 +107,12 @@ baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunke
|
|||||||
## 6. 已经能写的 paper 主张(按 confidence 排序)
|
## 6. 已经能写的 paper 主张(按 confidence 排序)
|
||||||
|
|
||||||
1. **Agentic vs chatbot 在调度上是不同 regime**(dispatch coupling + sub-second tool-call mass)—— 实证完整
|
1. **Agentic vs chatbot 在调度上是不同 regime**(dispatch coupling + sub-second tool-call mass)—— 实证完整
|
||||||
2. **PD-disaggregation 在 agentic 下输**(cost > benefit,跨拓扑)—— **MB1 + MB2 实证完整**
|
2. **三类现有调度 baseline 各自的失败模式**(load-balance / static PD-disagg / pure sticky)—— 实证完整
|
||||||
3. **三类现有调度 baseline 各自的失败模式** —— 实证完整
|
3. **Static PD-disagg 在 agentic 下失败的 dominant 根因是 D 侧 KV 容量**(不是 phase-isolation cost-benefit)—— 实证完整(`f4b` + colleague 4P+4D 数据)
|
||||||
4. **Affinity-default 调度(current unified)达到 APC 上界**,per-worker latency 也压倒 sticky —— 实证完整
|
4. **Mooncake transfer cost 拓扑无关**(intra ≈ inter,~9.7 GB/s NIC 上限)—— 实证完整(MB2)
|
||||||
5. **Hot-triggered migration 修复 sticky 的 hot pin** —— **design 完整、e2e 待验证**
|
5. **Phase isolation interference 在 chunked-prefill on 下仍然显著**(per-stream TPOT during prefill 15×–2000× baseline)—— 实证完整(MB1)。**注意**:这条数据本身不直接论证 "PD-disagg 失败",因为算正确账后 PD-disagg 反而在这一维上赢;它的用途是给 §3.2 提供 phase-isolation benefit 上界的量化。
|
||||||
|
6. **Affinity-default 调度(current unified)达到 APC 上界**,per-worker latency 也压倒 sticky —— 实证完整
|
||||||
|
7. **Hot-triggered migration 修复 sticky 的 hot pin** —— design 完整、e2e 待验证
|
||||||
|
|
||||||
## 7. 待做
|
## 7. 待做
|
||||||
|
|
||||||
|
|||||||
@@ -15,20 +15,28 @@ bottom; the **Summary** block is what gets cited.
|
|||||||
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
|
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
|
||||||
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
|
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
|
||||||
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
|
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
|
||||||
| Maximum PD-disagg benefit per agentic decode | **≤ 50–200 ms** (= decode duration) |
|
|
||||||
|
|
||||||
**§3.2 headline (cost vs benefit, this run + MB2)**:
|
**What MB1 actually measures**:
|
||||||
|
|
||||||
> Under chunked-prefill, every ongoing decode stream is essentially
|
> During a prefill burst, every ongoing decode stream is essentially
|
||||||
> **halted while a prefill chunk is in flight** — per-stream effective
|
> halted (per-stream effective TPOT is 15×–2000× baseline, scaling with
|
||||||
> TPOT during the burst is 15× to 2000× baseline, scaling with prefill
|
> prefill size). The **total decode time lost per prefill event is
|
||||||
> size. PD-disagg can recover this stall, but the recovery is bounded by
|
> `D × T_prefill`** (D concurrent decodes each lose ~T_prefill of useful
|
||||||
> the **decode duration** of the request being protected. For agentic,
|
> work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8
|
||||||
> decode is 50–200 ms (tool-call output). MB2 shows PD-disagg pays
|
> that's **~36 seconds of decode-equivalent work lost per request**.
|
||||||
> 300 ms – 10 s of KV-transfer cost per request to do that recovery. The
|
> This is the **upper bound on what PD-disaggregation's phase isolation
|
||||||
> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB
|
> could recover** on the decode side.
|
||||||
> (~830 tokens) — well below all agentic operating points. The benefit
|
|
||||||
> never beats the cost in this workload.
|
**⚠ Correction (2026-05-27)**: an earlier version of this README framed
|
||||||
|
the §3.2 PD-disagg argument as "phase-isolation benefit is capped at
|
||||||
|
the decode duration of the new request (~50–200 ms), so MB2 transfer
|
||||||
|
cost dominates". That framing was wrong. The correct accounting is
|
||||||
|
benefit-per-prefill-event = D × T_prefill (aggregate decode time saved
|
||||||
|
across all stalled streams), which is **much larger than per-request
|
||||||
|
transfer cost**. The actual reason static PD-disagg fails in agentic
|
||||||
|
is **D-side KV pool capacity** (`figs/f4b_pdsep_kv_wall.png`), not a
|
||||||
|
cost-vs-benefit imbalance on phase isolation. See `RESULTS_SUMMARY.md`
|
||||||
|
section 4 for the corrected framing.
|
||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
@@ -107,25 +115,30 @@ the cleanest "average over the whole burst window" number).
|
|||||||
halts decode). This is the entire prefill duration's worth of decode
|
halts decode). This is the entire prefill duration's worth of decode
|
||||||
time that could in principle be recovered.
|
time that could in principle be recovered.
|
||||||
|
|
||||||
Two big caveats for **agentic** application:
|
**Connecting to the §3.2 PD-disagg argument** (corrected):
|
||||||
|
|
||||||
1. **Decode is short** (~50–200 ms for tool-call output). The actual
|
PD-disagg's promised phase-isolation benefit is **per prefill event**,
|
||||||
recoverable benefit per request is bounded by the decode duration,
|
not per request. When a new prefill arrives, it stalls every concurrent
|
||||||
not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill
|
decode stream on the same GPU. The aggregate decode time lost across
|
||||||
collides with it, PD-disagg can save at most 100 ms — not 5 s.
|
those D streams is `D × T_prefill`. PD-disagg moving prefill off-decode-GPU
|
||||||
2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms – 10 s per request
|
recovers all of it.
|
||||||
for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
|
|
||||||
~100 ms benefit ceiling. Cost > benefit across the whole agentic
|
|
||||||
distribution.
|
|
||||||
|
|
||||||
## §3.2 cost-vs-benefit figure
|
Plugging numbers per prefill event:
|
||||||
|
|
||||||
`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50–200 ms
|
| Prefill size | T_prefill | PD-disagg cost (MB2 T_transfer) | PD-disagg benefit (D=8 × T_prefill) | Ratio |
|
||||||
band, capped by decode duration) on top of MB2 transfer cost curve. The
|
|---:|---:|---:|---:|---:|
|
||||||
cost curve crosses the benefit ceiling somewhere around **80 MiB / 830
|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s | 0.7 % |
|
||||||
tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace
|
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s | 0.9 % |
|
||||||
mean per request KV, and we know agentic averages 33k tokens, p99
|
| 125k tok (~p99) | 57 s | 1.9 s | 456 s | 0.4 % |
|
||||||
125k). For anything bigger PD-disagg pays more than it can recover.
|
|
||||||
|
On the **phase-isolation axis alone**, PD-disagg wins by 100×–250×.
|
||||||
|
The reason static PD-disagg nonetheless **fails in agentic** is a
|
||||||
|
*different* failure mode: the D-side KV pool cannot fit p90+ requests
|
||||||
|
(p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide
|
||||||
|
decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's
|
||||||
|
4P+4D experiment). The structural problem is **capacity** (see
|
||||||
|
`figs/f4b_pdsep_kv_wall.png`), not transfer-cost vs phase-isolation
|
||||||
|
trade-off.
|
||||||
|
|
||||||
## Reproduction
|
## Reproduction
|
||||||
|
|
||||||
@@ -174,4 +187,7 @@ ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop
|
|||||||
|
|
||||||
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
|
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
|
||||||
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
|
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
|
||||||
Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`.
|
Figure: `figs/mb1_interference.png`. The figure
|
||||||
|
`figs/pd_cost_vs_benefit.png` from the original commit `029821c` was
|
||||||
|
based on the wrong "benefit ≤ decode duration" accounting; **deleted in
|
||||||
|
the correction commit**.
|
||||||
|
|||||||
@@ -24,12 +24,25 @@ get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
|
|||||||
ceiling applies regardless. Halving the transfer cost cannot be bought
|
ceiling applies regardless. Halving the transfer cost cannot be bought
|
||||||
back by topology.
|
back by topology.
|
||||||
|
|
||||||
**Headline for the paper §3.2**: at the agentic tail, **pure KV transfer
|
**What MB2 actually measures**: the **per-request charge** that
|
||||||
takes 1.5 – 10 s**. A median agentic decode is **50 – 200 ms** of tool-call
|
PD-disagg pays for every routed request — `T_transfer ≈ KV_size / 9.7
|
||||||
output. So **PD-disaggregation adds 8 – 100 × decode-time of transfer on
|
GB/s`. For agentic this is **8 ms (192 MiB / trace lower) – 1.9 s
|
||||||
top of every routed request**. Phase isolation (the thing PD-disagg
|
(11.5 GiB / p99)**.
|
||||||
trades transfer cost for) can only win back at most one decode duration
|
|
||||||
— for agentic that's negligible. The arithmetic is one-sided.
|
**⚠ Correction (2026-05-27)**: an earlier version of this README
|
||||||
|
framed §3.2 as "transfer cost (1.5–10 s) >> decode duration (50–200 ms),
|
||||||
|
so PD-disagg loses on cost-vs-benefit." That accounting was wrong:
|
||||||
|
PD-disagg's phase-isolation benefit is **per-prefill-event** and equals
|
||||||
|
`D × T_prefill` (aggregate across stalled decode streams), not the
|
||||||
|
single-request decode duration. With trace-mean `T_prefill = 4.5 s` and
|
||||||
|
D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer
|
||||||
|
cost. PD-disagg's phase-isolation axis is a *win*, not a loss.
|
||||||
|
|
||||||
|
The actual reason static PD-disagg fails in agentic is **D-side KV
|
||||||
|
capacity** (`figs/f4b_pdsep_kv_wall.png`), not a cost-vs-benefit
|
||||||
|
imbalance. See `RESULTS_SUMMARY.md` section 4 for the corrected
|
||||||
|
framing. MB2 still serves as the source of the per-request transfer
|
||||||
|
cost number used in that analysis.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -137,43 +150,44 @@ analysis; not done yet.
|
|||||||
treats them as additional samples (same sizes); the per-size
|
treats them as additional samples (same sizes); the per-size
|
||||||
aggregates use all of them.
|
aggregates use all of them.
|
||||||
|
|
||||||
## Implications for §3.2 PD-disagg cost argument
|
## Implications for §3.2 PD-disagg argument
|
||||||
|
|
||||||
For each PD-disagg-routed request, transfer wall-time is:
|
For each PD-disagg-routed request, transfer wall-time is:
|
||||||
|
|
||||||
```
|
```
|
||||||
T_transfer(KV_size) = max( pure_transfer(KV_size), rx_overhead )
|
T_transfer(KV_size) ≈ KV_size / 9.7 GB/s for KV_size ≤ 3 GiB
|
||||||
≈ KV_size / 9.7 GB/s for KV_size <= 3 GiB
|
|
||||||
≈ 0.3 – 10 s for KV_size in [3, 12] GiB
|
≈ 0.3 – 10 s for KV_size in [3, 12] GiB
|
||||||
```
|
```
|
||||||
|
|
||||||
Agentic decode wall-time is typically 50 – 200 ms (tool-call output of
|
This is the **per-request transfer charge** of PD-disagg. It's a
|
||||||
a few tens of tokens at ~50 tok/s). So the **transfer/decode ratio**
|
real cost, but in the context of phase-isolation accounting it is
|
||||||
under intra-node best-case Mooncake is:
|
*small* compared to the benefit:
|
||||||
|
|
||||||
| KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode |
|
| Prefill | T_prefill (MB1) | T_transfer (MB2) | Phase-isolation benefit at D=8 = D × T_prefill |
|
||||||
|---|---:|---:|---:|
|
|---:|---:|---:|---:|
|
||||||
| 192 MiB (2k tok) | 20 ms | 100 ms | 0.2× |
|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s |
|
||||||
| 768 MiB (8k tok) | 84 ms | 100 ms | 0.8× |
|
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s |
|
||||||
| 3 GiB (33k tok ≈ trace mean) | 321 ms | 100 ms | **3.2×** |
|
| 125k tok (~p99) | 57 s | 1.9 s | 456 s |
|
||||||
| 6 GiB (~p90) | 1900 ms | 100 ms | **19×** |
|
|
||||||
| 12 GiB (~p99) | 2800 ms | 100 ms | **28×** (median) – **100×** (p99 variance) |
|
|
||||||
|
|
||||||
PD-disagg's promised payoff is *eliminating prefill–decode interference
|
On the phase-isolation axis alone, PD-disagg recovers two orders of
|
||||||
on the decode instance*. The maximum benefit it can buy is bounded
|
magnitude more decode time than it pays in transfer. **It is NOT this
|
||||||
above by the **decode duration itself** (you cannot recover more time
|
axis that defeats static PD-disagg in agentic** — see colleague's
|
||||||
than the decode existed). For agentic that's 50 – 200 ms. The cost is
|
4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is
|
||||||
the table column above — 0.3 – 10 s of transfer per routed request.
|
driven by **D-side KV-pool overflow** on long-context requests
|
||||||
|
(`figs/f4b_pdsep_kv_wall.png`), not by transfer latency.
|
||||||
|
|
||||||
**Cost > Benefit by 5× to 100× across the agentic distribution.** Below
|
What MB2 contributes to the paper is therefore:
|
||||||
~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above
|
- The **per-request transfer cost number** (used as the cost input
|
||||||
6 GiB even individual draws can take 10 s for a single transfer.
|
to the cost-benefit accounting above).
|
||||||
|
- The empirical observation that **Mooncake's transfer cost is
|
||||||
|
topology-independent** — intra-node and inter-node both go through
|
||||||
|
the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's
|
||||||
|
transfer cost does not get cheaper by co-locating P and D.
|
||||||
|
|
||||||
This data alone is not the whole §3.2 argument — we still need to
|
The dominant §3.2 failure mode of static PD-disagg in agentic is
|
||||||
account for D-side KV capacity (`f4b`, separate axis), cache reuse loss,
|
**capacity**, not transfer cost. MB3 / MB4 / MB5 will quantify the
|
||||||
and static-partition mismatch (MB3 / MB4 / MB5). But it nails one
|
remaining axes (D-pool occupancy, cache reuse degradation under PD
|
||||||
of the two key cost axes with measured numbers from vanilla mooncake,
|
routing, static-partition mismatch).
|
||||||
not the dash0 patched build.
|
|
||||||
|
|
||||||
## Open questions / next runs
|
## Open questions / next runs
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 122 KiB After Width: | Height: | Size: 114 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 161 KiB |
@@ -1,20 +1,19 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
|
"""Plot MB1 phase-interference data.
|
||||||
|
|
||||||
Two outputs:
|
Single output: figs/mb1_interference.png — effective per-stream TPOT
|
||||||
|
during a prefill burst, vs prefill size, one line per concurrent decode
|
||||||
|
batch size D.
|
||||||
|
|
||||||
mb1_interference.png
|
Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
|
||||||
Effective TPOT during prefill vs prefill size, one line per D.
|
which composed a "max PD-disagg benefit = decode duration (50–200 ms)
|
||||||
Log-log. Annotates typical agentic decode duration (~100 ms) as a
|
band" against the MB2 transfer-cost curve. That accounting was wrong
|
||||||
horizontal band so reader can spot when decode would be stalled.
|
(see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
|
||||||
|
per-prefill-event, equal to D × T_prefill across stalled streams, not
|
||||||
pd_cost_vs_benefit.png
|
capped by a single request's decode duration. That figure has been
|
||||||
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
|
removed; the math it implied was structurally backwards. The dominant
|
||||||
- benefit ceiling (MB1) — at most one decode-duration per request
|
reason static PD-disagg fails in agentic is **D-side KV capacity**
|
||||||
of phase isolation can be recovered. Drawn as a flat 100 ms line.
|
(see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
|
||||||
- cost (MB2) — Mooncake pure_transfer p50 at that size.
|
|
||||||
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
|
|
||||||
structurally loses.
|
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@@ -25,21 +24,16 @@ from pathlib import Path
|
|||||||
import matplotlib
|
import matplotlib
|
||||||
matplotlib.use("Agg")
|
matplotlib.use("Agg")
|
||||||
import matplotlib.pyplot as plt
|
import matplotlib.pyplot as plt
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
|
|
||||||
def main() -> None:
|
def main() -> None:
|
||||||
p = argparse.ArgumentParser()
|
p = argparse.ArgumentParser()
|
||||||
p.add_argument("--mb1", type=Path, required=True)
|
p.add_argument("--mb1", type=Path, required=True)
|
||||||
p.add_argument("--mb2-intra", type=Path, required=True)
|
p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
|
||||||
p.add_argument("--mb2-inter", type=Path, default=None)
|
|
||||||
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
|
|
||||||
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
|
|
||||||
args = p.parse_args()
|
args = p.parse_args()
|
||||||
|
|
||||||
mb1 = json.loads(args.mb1.read_text())["summary"]
|
mb1 = json.loads(args.mb1.read_text())["summary"]
|
||||||
|
|
||||||
# ---- mb1_interference.png ----
|
|
||||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||||||
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
||||||
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
||||||
@@ -50,79 +44,19 @@ def main() -> None:
|
|||||||
ys = [s["effective_tpot_during_ms"] for s in rows]
|
ys = [s["effective_tpot_during_ms"] for s in rows]
|
||||||
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
||||||
color=colors.get(D, "gray"),
|
color=colors.get(D, "gray"),
|
||||||
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||||||
|
|
||||||
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
|
|
||||||
(100, "agentic decode (~100 ms)"),
|
|
||||||
(200, "long agentic decode (~200 ms)")]:
|
|
||||||
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
|
|
||||||
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
|
|
||||||
|
|
||||||
ax.set_xscale("log"); ax.set_yscale("log")
|
ax.set_xscale("log"); ax.set_yscale("log")
|
||||||
ax.set_xlabel("Prefill burst size (tokens, log)")
|
ax.set_xlabel("Prefill burst size (tokens, log)")
|
||||||
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
||||||
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
||||||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
|
"(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
|
||||||
|
"Per-prefill aggregate stall = D × T_prefill.")
|
||||||
ax.grid(True, which="both", alpha=0.3)
|
ax.grid(True, which="both", alpha=0.3)
|
||||||
ax.legend(loc="upper left", fontsize=9)
|
ax.legend(loc="upper left", fontsize=9)
|
||||||
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
|
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||||
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
|
fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
|
||||||
print(f"wrote {args.out_interf}")
|
print(f"wrote {args.out}")
|
||||||
|
|
||||||
# ---- pd_cost_vs_benefit.png ----
|
|
||||||
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
|
|
||||||
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
|
|
||||||
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
|
|
||||||
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
|
|
||||||
|
|
||||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
|
||||||
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
|
|
||||||
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
|
|
||||||
if args.mb2_inter:
|
|
||||||
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
|
|
||||||
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
|
|
||||||
inter_x = [s["kv_mib"] for s in mb2_inter]
|
|
||||||
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
|
|
||||||
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
|
|
||||||
alpha=0.7, label="MB2 inter-node (same numbers)")
|
|
||||||
|
|
||||||
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
|
|
||||||
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
|
|
||||||
label="MB1 max benefit ≤ agentic decode (~100 ms)")
|
|
||||||
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
|
|
||||||
label="benefit range (50–200 ms decode)")
|
|
||||||
|
|
||||||
# Mark agentic-tail request sizes
|
|
||||||
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
|
|
||||||
(3072, "p90\n(~33k tok)"),
|
|
||||||
(6144, "p95\n(~65k tok)"),
|
|
||||||
(11500, "p99\n(11.5 GiB)")]:
|
|
||||||
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
|
|
||||||
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
|
|
||||||
ha="center", va="bottom")
|
|
||||||
|
|
||||||
ax.set_xscale("log"); ax.set_yscale("log")
|
|
||||||
ax.set_xlim(40, 14000)
|
|
||||||
ax.set_ylim(1, 12000)
|
|
||||||
ax.set_xlabel("Per-request KV size (MiB, log)")
|
|
||||||
ax.set_ylabel("Time per request (ms, log)")
|
|
||||||
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
|
|
||||||
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
|
|
||||||
ax.grid(True, which="both", alpha=0.3)
|
|
||||||
ax.legend(loc="upper left", fontsize=9)
|
|
||||||
|
|
||||||
# Add explanatory annotation
|
|
||||||
ax.text(10000, 5000,
|
|
||||||
"Cost > benefit for ANY KV size above\n"
|
|
||||||
"the green band (~80 MiB / ~830 tokens).\n"
|
|
||||||
"Below: cost is marginal (<10 ms) but\n"
|
|
||||||
"benefit is also small (decode is short).",
|
|
||||||
fontsize=9, color="#333",
|
|
||||||
ha="right", va="top",
|
|
||||||
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
|
|
||||||
|
|
||||||
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
|
|
||||||
print(f"wrote {args.out_cb}")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
Reference in New Issue
Block a user