The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8.2 KiB
目前已成立的结论(2026-05-27)
EAR 项目目前能用实测数据支撑的论点汇总。每条都标了对应的图/数据路径。
1. Workload 性质(§2)
Production trace = Qwen3-Coder agentic,1.3 M sessions / 2.1 M reqs / 7200 s。
| 性质 | 数据 | 实证 |
|---|---|---|
| KV 复用几乎全在 session 内 | intra 93.2% / cross 5.7% / shared 1.1%;理论 APC 上界 79.6% | figs/f2a_reuse_topology.png |
| Session 极度偏斜 | top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% input mass | figs/f2b_session_skew.png |
| 单请求 KV 已经很大 | p50 1.8 GiB / p90 8.0 / p95 9.6 / p99 11.5 GiB;KV pool 38 GiB/instance(0.4 × H20 96 GiB)→ p99 req 只能装 3 个/instance | figs/f2c_kv_footprint_cdf.png |
结论:cache 是 session-local 的,scheduling 必须保留 session affinity;单 request KV 接近 pool 上限,PD-disagg 4P+4D 让系统 decode 容量直接减半。
2. Dispatch Coupling(§2.3)
| 数据 | Agentic (Qwen3-Coder) | Chatbot (qwen3-max) |
|---|---|---|
Inter-turn T_external p50 |
1.6 s | 7.2 s |
gap < 1 s 比例 |
39% | 4% |
gap < 5 s 比例 |
67% | 29% |
| p99 | 738 s | 43 s |
参考图:figs/f3a_inter_turn_gap.png。
结论:agentic 有一段 chatbot 没有的 sub-second tool-call mode(39% vs 4%)。当 W_turn ≳ T_external(任何 W_turn > 1 s 的 scheduler 在 agentic 上都满足这条件),Little's Law L = Λ · N · (W_turn(L) + T_external) 进入闭环 regime,scheduler 的 ε 退步通过 KV 竞争反馈环被放大成 wall-clock 数倍差距。实测:lmetric 跑 600 s trace 用 49 min wall-clock = 8x amplification。
3. 现有调度的三类失败(§3)
| Baseline | 失败模式 | 数据 |
|---|---|---|
| load-balance / LMetric | 丢 locality | lmetric APC 56.9%(vs 上界 79.6%);LMetric 比 load_only 只好 +3.3pp,因为 cache 信号在乘性 score (pending+input−hit) × num_req 里被 num_req 吞掉 |
| 静态 PD-disagg | D 侧 KV 容量墙 + transfer 成本 | 见 §4 cost-vs-benefit |
| Pure sticky | 全员被 hot session 拖累,不是单一热点 | sticky median worker 20.3 s vs unified 10.3 s;system e2e p90 sticky 34.6 s vs unified 18.0 s(用 max/median ratio 衡量是误导,§3.3 用 absolute per-worker latency) |
参考图:figs/f4a_apc_loss.png、figs/f4b_pdsep_kv_wall.png、figs/f4c_per_worker_ttft.png、figs/f6_e2e_latency_bars.png、figs/f6_e2e_latency_full_grid.png。
4. Static PD-disagg 为什么失败(§3.2)—— 容量问题,不是 cost-benefit 问题
⚠ 2026-05-27 纠正:本节前一版本论证"PD-disagg 因为 transfer cost > phase isolation benefit 而失败"。这个论证算错了。正确的 phase-isolation benefit 是每个 prefill 事件 × D 个 concurrent stream 的总和(≈ D × T_prefill),不是单个 request 的 decode 时长。用正确公式,PD-disagg 在 phase-isolation 这一维上赢 colo 一两个数量级。Static PD-disagg 在 agentic 上失败的真正根因是 D 侧 KV pool 容量,不是这一维。
4.1 真正的失败模式:D 侧 KV 容量天花板
| 8C colo | 4P+4D PD-disagg | |
|---|---|---|
| Per-D-instance KV pool(0.4 × 96 GiB) | 38 GiB | 38 GiB |
| 系统总 decode 容量(D 实例数 × 单池) | 8 × 38 = 304 GiB | 4 × 38 = 152 GiB |
| p99 单请求 KV = 11.5 GiB → 最多并发 decode | 24 | 12(减半) |
Colleague 4P+4D 实测:TTFT p50 0.91 s → 62.8 s(62×)、success rate 99.5% → 52%。失败模式:D 池溢出 + 排队,不是 transfer 延迟。
参考图:figs/f4b_pdsep_kv_wall.png(pdf 版本是高质量 paper figure)。
4.2 MB2 — KV transfer cost(per-request 一次性成本,不是 dominant cost)
dash1 GPU 0+1(intra)和 dash1 ↔ dash2(inter, 200 Gbps RoCE)扫 9 个 size × 5 reps。
| 路径 | 稳态带宽(≤ 3 GiB) | p99 agentic 请求 11.5 GiB transfer |
|---|---|---|
| Intra-node | 9.7 GB/s | p50 1.9 s · max 10 s |
| Inter-node | 10.0 GB/s(差 <3%) | p50 1.7 s · max 9.2 s |
新发现:intra/inter 几乎重合 → Mooncake batch_transfer_sync_write 永远走 RDMA NIC,不走 NVLink。200 Gbps NIC 是天花板。PD-disagg transfer cost 与拓扑无关。
参考图:figs/mb2_transfer_time_compare.png、doc analysis/mb2/README.md。
4.3 MB1 — Phase interference(PD-disagg 的潜在 benefit 上界)
dash1 GPU 0 单 instance(无 kv_connector),chunked-prefill 默认开启,D × P 扫描。D=8 结果:
| Prefill | T_prefill | per-stream TPOT during | penalty |
|---|---|---|---|
| 2k tok | 143 ms | 32 ms | 4× |
| 32k tok | 4.5 s | 388 ms | 52× |
| 131k tok | 57 s | 1419 ms | 183× |
Decode 在 prefill 期间被几乎完全 halted,单 stream 损失 ≈ T_prefill 的时间。每个 prefill event 总 decode 损失 ≈ D × T_prefill。
参考图:figs/mb1_interference.png、doc analysis/mb1/README.md。
4.4 联合 cost-benefit(per-prefill event)
| Prefill (KV size) | T_prefill | Cost = T_transfer | Benefit = D × T_prefill (D=8) | Cost / Benefit |
|---|---|---|---|---|
| 2k tok (192 MiB) | 0.14 s | 8 ms | 1.1 s | 0.7% |
| 33k tok (3 GiB, trace mean) | 4.5 s | 0.32 s | 36 s | 0.9% |
| 125k tok (12 GiB, ~p99) | 57 s | 1.9 s | 456 s | 0.4% |
PD-disagg 在 phase-isolation 这一维赢 100×–250×。但这不是 §3.2 该用的论证,因为 §3.2 真正的 dominant failure 是 §4.1 的 D 池容量天花板(颠覆了上面的全部数学)。
总结:
- D 侧 KV 容量天花板(§4.1)→ PD-disagg 在 agentic 上结构性失败。
- MB1 + MB2 的合计 cost-benefit 在 phase isolation 维度上 PD-disagg 是赢的,但这件事被容量天花板压倒。
- Paper §3.2 论证应该聚焦"D 池装不下",MB1/MB2 数据用作 supporting context(per-request transfer charge 量化、phase isolation benefit 量化)而不是 main argument。
5. EAR 设计的实证状态(§4)
| Pillar | 已实证 | 待实证 |
|---|---|---|
| Affinity-default routing (Pillar 1) | ✅ Current unified 算法 = LMetric + high-cache affinity;APC 79.4%(达到 79.6% 上界 97%),TTFT p90 7.3 s,median worker p90 10.3 s |
— |
| Hot-triggered session migration (Pillar 2) | substrate 已通:kv_both connector 在 trace replay 上 net positive(TTFT p90 −18.6%,DR-fix 后 −36.6%),原 elastic_migration_v2 paper 的 "+45% kv_both penalty" obsolete |
e2e 策略层(trigger 阈值 + target selection 在反馈环里)未直接验证 |
6. 已经能写的 paper 主张(按 confidence 排序)
- Agentic vs chatbot 在调度上是不同 regime(dispatch coupling + sub-second tool-call mass)—— 实证完整
- 三类现有调度 baseline 各自的失败模式(load-balance / static PD-disagg / pure sticky)—— 实证完整
- Static PD-disagg 在 agentic 下失败的 dominant 根因是 D 侧 KV 容量(不是 phase-isolation cost-benefit)—— 实证完整(
f4b+ colleague 4P+4D 数据) - Mooncake transfer cost 拓扑无关(intra ≈ inter,~9.7 GB/s NIC 上限)—— 实证完整(MB2)
- Phase isolation interference 在 chunked-prefill on 下仍然显著(per-stream TPOT during prefill 15×–2000× baseline)—— 实证完整(MB1)。注意:这条数据本身不直接论证 "PD-disagg 失败",因为算正确账后 PD-disagg 反而在这一维上赢;它的用途是给 §3.2 提供 phase-isolation benefit 上界的量化。
- Affinity-default 调度(current unified)达到 APC 上界,per-worker latency 也压倒 sticky —— 实证完整
- Hot-triggered migration 修复 sticky 的 hot pin —— design 完整、e2e 待验证
7. 待做
- MB3-5(end-to-end PD-disagg deployment):D-pool runtime occupancy、cache reuse × PD interaction、PD ratio sweep。这些是 §5 完整实验矩阵的事
- EAR Pillar 2 migration e2e validation(在 connector_tax DR-fix 之上重测)
- §5.4 wall-clock amplification sweep(5 baseline × 3 runs,钉死 dispatch coupling 论证的实证 closure)
- Scale-out 验证(dash1+dash2 = 16 GPU,等 dash0 + 3-node 可用时扩到 80 GPU)