Files

Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 22:04:49 +08:00

8.2 KiB

Raw Blame History

目前已成立的结论（2026-05-27）

EAR 项目目前能用实测数据支撑的论点汇总。每条都标了对应的图/数据路径。

1. Workload 性质（§2）

Production trace = Qwen3-Coder agentic，1.3 M sessions / 2.1 M reqs / 7200 s。

性质	数据	实证
KV 复用几乎全在 session 内	intra 93.2% / cross 5.7% / shared 1.1%；理论 APC 上界 79.6%	`figs/f2a_reuse_topology.png`
Session 极度偏斜	top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% input mass	`figs/f2b_session_skew.png`
单请求 KV 已经很大	p50 1.8 GiB / p90 8.0 / p95 9.6 / p99 11.5 GiB；KV pool 38 GiB/instance（0.4 × H20 96 GiB）→ p99 req 只能装 3 个/instance	`figs/f2c_kv_footprint_cdf.png`

结论：cache 是 session-local 的，scheduling 必须保留 session affinity；单 request KV 接近 pool 上限，PD-disagg 4P+4D 让系统 decode 容量直接减半。

2. Dispatch Coupling（§2.3）

数据	Agentic (Qwen3-Coder)	Chatbot (qwen3-max)
Inter-turn `T_external` p50	1.6 s	7.2 s
`gap < 1 s` 比例	39%	4%
`gap < 5 s` 比例	67%	29%
p99	738 s	43 s

参考图：figs/f3a_inter_turn_gap.png。

结论：agentic 有一段 chatbot 没有的 sub-second tool-call mode（39% vs 4%）。当 W_turn ≳ T_external（任何 W_turn > 1 s 的 scheduler 在 agentic 上都满足这条件），Little's Law L = Λ · N · (W_turn(L) + T_external) 进入闭环 regime，scheduler 的 ε 退步通过 KV 竞争反馈环被放大成 wall-clock 数倍差距。实测：lmetric 跑 600 s trace 用 49 min wall-clock = 8x amplification。

3. 现有调度的三类失败（§3）

Baseline	失败模式	数据
load-balance / LMetric	丢 locality	lmetric APC 56.9%（vs 上界 79.6%）；LMetric 比 load_only 只好 +3.3pp，因为 cache 信号在乘性 score `(pending+input−hit) × num_req` 里被 num_req 吞掉
静态 PD-disagg	D 侧 KV 容量墙 + transfer 成本	见 §4 cost-vs-benefit
Pure sticky	全员被 hot session 拖累，不是单一热点	sticky median worker 20.3 s vs unified 10.3 s；system e2e p90 sticky 34.6 s vs unified 18.0 s（用 max/median ratio 衡量是误导，§3.3 用 absolute per-worker latency）

参考图：figs/f4a_apc_loss.png、figs/f4b_pdsep_kv_wall.png、figs/f4c_per_worker_ttft.png、figs/f6_e2e_latency_bars.png、figs/f6_e2e_latency_full_grid.png。

4. Static PD-disagg 为什么失败（§3.2）—— 容量问题，不是 cost-benefit 问题

⚠ 2026-05-27 纠正：本节前一版本论证"PD-disagg 因为 transfer cost > phase isolation benefit 而失败"。这个论证算错了。正确的 phase-isolation benefit 是每个 prefill 事件 × D 个 concurrent stream 的总和（≈ D × T_prefill），不是单个 request 的 decode 时长。用正确公式，PD-disagg 在 phase-isolation 这一维上赢 colo 一两个数量级。Static PD-disagg 在 agentic 上失败的真正根因是 D 侧 KV pool 容量，不是这一维。

4.1 真正的失败模式：D 侧 KV 容量天花板

	8C colo	4P+4D PD-disagg
Per-D-instance KV pool（0.4 × 96 GiB）	38 GiB	38 GiB
系统总 decode 容量（D 实例数 × 单池）	8 × 38 = 304 GiB	4 × 38 = 152 GiB
p99 单请求 KV = 11.5 GiB → 最多并发 decode	24	12（减半）

Colleague 4P+4D 实测：TTFT p50 0.91 s → 62.8 s（62×）、success rate 99.5% → 52%。失败模式：D 池溢出 + 排队，不是 transfer 延迟。

参考图：figs/f4b_pdsep_kv_wall.png（pdf 版本是高质量 paper figure）。

4.2 MB2 — KV transfer cost（per-request 一次性成本，不是 dominant cost）

dash1 GPU 0+1（intra）和 dash1 ↔ dash2（inter, 200 Gbps RoCE）扫 9 个 size × 5 reps。

路径	稳态带宽（≤ 3 GiB）	p99 agentic 请求 11.5 GiB transfer
Intra-node	9.7 GB/s	p50 1.9 s · max 10 s
Inter-node	10.0 GB/s（差 <3%）	p50 1.7 s · max 9.2 s

新发现：intra/inter 几乎重合 → Mooncake batch_transfer_sync_write 永远走 RDMA NIC，不走 NVLink。200 Gbps NIC 是天花板。PD-disagg transfer cost 与拓扑无关。

参考图：figs/mb2_transfer_time_compare.png、doc analysis/mb2/README.md。

4.3 MB1 — Phase interference（PD-disagg 的潜在 benefit 上界）

dash1 GPU 0 单 instance（无 kv_connector），chunked-prefill 默认开启，D × P 扫描。D=8 结果：

Prefill	T_prefill	per-stream TPOT during	penalty
2k tok	143 ms	32 ms	4×
32k tok	4.5 s	388 ms	52×
131k tok	57 s	1419 ms	183×

Decode 在 prefill 期间被几乎完全 halted，单 stream 损失 ≈ T_prefill 的时间。每个 prefill event 总 decode 损失 ≈ D × T_prefill。

参考图：figs/mb1_interference.png、doc analysis/mb1/README.md。

4.4 联合 cost-benefit（per-prefill event）

Prefill (KV size)	T_prefill	Cost = T_transfer	Benefit = D × T_prefill (D=8)	Cost / Benefit
2k tok (192 MiB)	0.14 s	8 ms	1.1 s	0.7%
33k tok (3 GiB, trace mean)	4.5 s	0.32 s	36 s	0.9%
125k tok (12 GiB, ~p99)	57 s	1.9 s	456 s	0.4%

PD-disagg 在 phase-isolation 这一维赢 100×–250×。但这不是 §3.2 该用的论证，因为 §3.2 真正的 dominant failure 是 §4.1 的 D 池容量天花板（颠覆了上面的全部数学）。

总结：

D 侧 KV 容量天花板（§4.1）→ PD-disagg 在 agentic 上结构性失败。
MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的，但这件事被容量天花板压倒。
Paper §3.2 论证应该聚焦"D 池装不下"，MB1/MB2 数据用作 supporting context（per-request transfer charge 量化、phase isolation benefit 量化）而不是 main argument。

5. EAR 设计的实证状态（§4）

Pillar	已实证	待实证
Affinity-default routing (Pillar 1)	✅ Current `unified` 算法 = LMetric + high-cache affinity；APC 79.4%（达到 79.6% 上界 97%），TTFT p90 7.3 s，median worker p90 10.3 s	—
Hot-triggered session migration (Pillar 2)	substrate 已通：`kv_both` connector 在 trace replay 上 net positive（TTFT p90 −18.6%，DR-fix 后 −36.6%），原 elastic_migration_v2 paper 的 "+45% kv_both penalty" obsolete	e2e 策略层（trigger 阈值 + target selection 在反馈环里）未直接验证

6. 已经能写的 paper 主张（按 confidence 排序）

Agentic vs chatbot 在调度上是不同 regime（dispatch coupling + sub-second tool-call mass）—— 实证完整
三类现有调度 baseline 各自的失败模式（load-balance / static PD-disagg / pure sticky）—— 实证完整
Static PD-disagg 在 agentic 下失败的 dominant 根因是 D 侧 KV 容量（不是 phase-isolation cost-benefit）—— 实证完整（f4b + colleague 4P+4D 数据）
Mooncake transfer cost 拓扑无关（intra ≈ inter，~9.7 GB/s NIC 上限）—— 实证完整（MB2）
Phase isolation interference 在 chunked-prefill on 下仍然显著（per-stream TPOT during prefill 15×–2000× baseline）—— 实证完整（MB1）。注意：这条数据本身不直接论证 "PD-disagg 失败"，因为算正确账后 PD-disagg 反而在这一维上赢；它的用途是给 §3.2 提供 phase-isolation benefit 上界的量化。
Affinity-default 调度（current unified）达到 APC 上界，per-worker latency 也压倒 sticky —— 实证完整
Hot-triggered migration 修复 sticky 的 hot pin —— design 完整、e2e 待验证

7. 待做

MB3-5（end-to-end PD-disagg deployment）：D-pool runtime occupancy、cache reuse × PD interaction、PD ratio sweep。这些是 §5 完整实验矩阵的事
EAR Pillar 2 migration e2e validation（在 connector_tax DR-fix 之上重测）
§5.4 wall-clock amplification sweep（5 baseline × 3 runs，钉死 dispatch coupling 论证的实证 closure）
Scale-out 验证（dash1+dash2 = 16 GPU，等 dash0 + 3-node 可用时扩到 80 GPU）

8.2 KiB Raw Blame History Unescape Escape