Add RESULTS_SUMMARY.md — concise Chinese summary of current findings
One-page distillation of what the paper can claim today, with figure /
data path next to each row. Sections:
1. Workload 性质 — intra-session reuse, skew, KV footprint
2. Dispatch Coupling — agentic vs chatbot inter-turn gap regime
3. 现有调度三类失败 — load-balance / static PD-disagg / pure sticky
4. PD-disagg cost vs benefit — MB2 (transfer 9.7 GB/s ceiling,
topology-independent) + MB1 (decode halted during prefill 15-200x),
joined into the §3.2 cost > benefit headline for any KV ≥ 80 MiB
5. EAR 实证状态 — Pillar 1 (affinity) validated, Pillar 2 (migration)
substrate validated + strategy-layer pending
6. 已能写的 paper 主张(按 confidence 排序)
7. 待做(MB3-5, migration e2e, wall-clock sweep, scale-out)
Designed to be the one doc to read when re-entering the project after
a break.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
109
RESULTS_SUMMARY.md
Normal file
109
RESULTS_SUMMARY.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# 目前已成立的结论(2026-05-27)
|
||||
|
||||
EAR 项目目前能用实测数据支撑的论点汇总。每条都标了对应的图/数据路径。
|
||||
|
||||
---
|
||||
|
||||
## 1. Workload 性质(§2)
|
||||
|
||||
Production trace = Qwen3-Coder agentic,1.3 M sessions / 2.1 M reqs / 7200 s。
|
||||
|
||||
| 性质 | 数据 | 实证 |
|
||||
|---|---|---|
|
||||
| **KV 复用几乎全在 session 内** | intra 93.2% / cross 5.7% / shared 1.1%;理论 APC 上界 79.6% | `figs/f2a_reuse_topology.png` |
|
||||
| **Session 极度偏斜** | top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/**96.0%** input mass | `figs/f2b_session_skew.png` |
|
||||
| **单请求 KV 已经很大** | p50 1.8 GiB / p90 8.0 / p95 9.6 / **p99 11.5 GiB**;KV pool 38 GiB/instance(0.4 × H20 96 GiB)→ p99 req 只能装 **3 个/instance** | `figs/f2c_kv_footprint_cdf.png` |
|
||||
|
||||
**结论**:cache 是 session-local 的,scheduling 必须保留 session affinity;单 request KV 接近 pool 上限,**PD-disagg 4P+4D 让系统 decode 容量直接减半**。
|
||||
|
||||
## 2. Dispatch Coupling(§2.3)
|
||||
|
||||
| 数据 | Agentic (Qwen3-Coder) | Chatbot (qwen3-max) |
|
||||
|---|---:|---:|
|
||||
| Inter-turn `T_external` p50 | **1.6 s** | 7.2 s |
|
||||
| `gap < 1 s` 比例 | **39%** | 4% |
|
||||
| `gap < 5 s` 比例 | 67% | 29% |
|
||||
| p99 | 738 s | 43 s |
|
||||
|
||||
参考图:`figs/f3a_inter_turn_gap.png`。
|
||||
|
||||
**结论**:agentic 有一段 chatbot 没有的 **sub-second tool-call mode**(39% vs 4%)。当 `W_turn ≳ T_external`(任何 W_turn > 1 s 的 scheduler 在 agentic 上都满足这条件),Little's Law `L = Λ · N · (W_turn(L) + T_external)` 进入闭环 regime,scheduler 的 ε 退步通过 KV 竞争反馈环被放大成 wall-clock 数倍差距。**实测**:lmetric 跑 600 s trace 用 49 min wall-clock = **8x amplification**。
|
||||
|
||||
## 3. 现有调度的三类失败(§3)
|
||||
|
||||
| Baseline | 失败模式 | 数据 |
|
||||
|---|---|---|
|
||||
| **load-balance / LMetric** | 丢 locality | lmetric APC **56.9%**(vs 上界 79.6%);LMetric 比 load_only 只好 +3.3pp,因为 cache 信号在乘性 score `(pending+input−hit) × num_req` 里被 num_req 吞掉 |
|
||||
| **静态 PD-disagg** | D 侧 KV 容量墙 + transfer 成本 | 见 §4 cost-vs-benefit |
|
||||
| **Pure sticky** | 全员被 hot session 拖累,不是单一热点 | sticky median worker 20.3 s vs unified 10.3 s;system e2e p90 sticky 34.6 s vs unified 18.0 s(**用 max/median ratio 衡量是误导**,§3.3 用 absolute per-worker latency)|
|
||||
|
||||
参考图:`figs/f4a_apc_loss.png`、`figs/f4b_pdsep_kv_wall.png`、`figs/f4c_per_worker_ttft.png`、`figs/f6_e2e_latency_bars.png`、`figs/f6_e2e_latency_full_grid.png`。
|
||||
|
||||
## 4. PD-disagg 在 agentic 下输——cost vs benefit(§3.2)
|
||||
|
||||
由两个独立 microbench 钉死(**全用 vanilla vLLM 0.18.1 + Mooncake 0.3.11,fresh venv,无 patch**)。
|
||||
|
||||
### 4.1 MB2 — KV transfer cost
|
||||
|
||||
dash1 GPU 0+1(intra-node)和 dash1 ↔ dash2(inter-node, 200 Gbps RoCE)扫 9 个 size × 5 reps。
|
||||
|
||||
| 路径 | 稳态带宽(≤ 3 GiB) | p99 agentic 请求(11.5 GiB)transfer 时间 |
|
||||
|---|---|---|
|
||||
| Intra-node | **9.7 GB/s** | p50 **1.9 s** · min 1.5 s · max 10 s |
|
||||
| Inter-node | **10.0 GB/s**(差 <3%) | p50 **1.7 s** · min 1.3 s · max 9.2 s |
|
||||
|
||||
**新发现**:intra/inter 几乎重合 → **Mooncake `batch_transfer_sync_write` 永远走 RDMA NIC,包括 intra-node loopback**,不走 NVLink。200 Gbps NIC 是天花板,**PD-disagg 的 transfer cost 与拓扑无关**。
|
||||
|
||||
参考图:`figs/mb2_transfer_time_compare.png`、`figs/mb2_transfer_bw_compare.png`、doc `analysis/mb2/README.md`。
|
||||
|
||||
### 4.2 MB1 — Phase interference(chunked-prefill on, 默认 baseline)
|
||||
|
||||
dash1 GPU 0 单 instance,D(concurrent decodes)× P(prefill size)扫描。
|
||||
|
||||
D=8(最 agentic-realistic)的结果:
|
||||
|
||||
| Prefill | prefill_ttft | per-stream TPOT during | penalty |
|
||||
|---|---:|---:|---:|
|
||||
| 2k tok | 143 ms | 32 ms | 4× |
|
||||
| 8k | 583 ms | 114 ms | 15× |
|
||||
| 32k | 4.5 s | 388 ms | **52×** |
|
||||
| 65k | 15.6 s | 757 ms | **99×** |
|
||||
| 131k | 57 s | 1419 ms | **183×** |
|
||||
|
||||
baseline TPOT 7.7 ms。**Decode 在大 prefill 期间基本被 halted**。chunked-prefill 已经默认开启,PD-disagg 在它之上能额外提供的 phase isolation = **decode 在 prefill 期间被 halted 的那部分时间**。
|
||||
|
||||
参考图:`figs/mb1_interference.png`、doc `analysis/mb1/README.md`。
|
||||
|
||||
### 4.3 联合结论
|
||||
|
||||
| | Per-request |
|
||||
|---|---|
|
||||
| **Max PD-disagg benefit**(救回来的 decode 时间)| ≤ **decode 时长 = 50–200 ms**(agentic tool-call output)|
|
||||
| **PD-disagg cost**(MB2 transfer p50)| 80 MiB ≈ 8 ms · 3 GiB ≈ 320 ms · 11.5 GiB ≈ **1.9 s**(p99 实测最差 10 s)|
|
||||
| Cost / Benefit | **每个 KV ≥ 80 MiB 的请求都输**;trace 平均 KV 192 MiB → 已经输 |
|
||||
|
||||
**结论**:在 agentic 上 **PD-disaggregation 是结构性失败的**。Chunked-prefill 默认已经在 colocation 内做了 first-order phase isolation;PD-disagg 在此之上能额外补的(decode 短时段没被 prefill 挤)小于它新带来的(每个 routed 请求都付 KV transfer)。这个结论与拓扑无关(intra-node 和 inter-node 一样)。
|
||||
|
||||
参考图:`figs/pd_cost_vs_benefit.png`(§3.2 headline)。
|
||||
|
||||
## 5. EAR 设计的实证状态(§4)
|
||||
|
||||
| Pillar | 已实证 | 待实证 |
|
||||
|---|---|---|
|
||||
| **Affinity-default routing** (Pillar 1) | ✅ Current `unified` 算法 = LMetric + high-cache affinity;APC **79.4%**(达到 79.6% 上界 97%),TTFT p90 **7.3 s**,median worker p90 **10.3 s** | — |
|
||||
| **Hot-triggered session migration** (Pillar 2) | substrate 已通:`kv_both` connector 在 trace replay 上 net positive(TTFT p90 −18.6%,DR-fix 后 −36.6%),原 elastic_migration_v2 paper 的 "+45% kv_both penalty" obsolete | e2e 策略层(trigger 阈值 + target selection 在反馈环里)未直接验证 |
|
||||
|
||||
## 6. 已经能写的 paper 主张(按 confidence 排序)
|
||||
|
||||
1. **Agentic vs chatbot 在调度上是不同 regime**(dispatch coupling + sub-second tool-call mass)—— 实证完整
|
||||
2. **PD-disaggregation 在 agentic 下输**(cost > benefit,跨拓扑)—— **MB1 + MB2 实证完整**
|
||||
3. **三类现有调度 baseline 各自的失败模式** —— 实证完整
|
||||
4. **Affinity-default 调度(current unified)达到 APC 上界**,per-worker latency 也压倒 sticky —— 实证完整
|
||||
5. **Hot-triggered migration 修复 sticky 的 hot pin** —— **design 完整、e2e 待验证**
|
||||
|
||||
## 7. 待做
|
||||
|
||||
- **MB3-5**(end-to-end PD-disagg deployment):D-pool runtime occupancy、cache reuse × PD interaction、PD ratio sweep。这些是 §5 完整实验矩阵的事
|
||||
- **EAR Pillar 2 migration e2e validation**(在 connector_tax DR-fix 之上重测)
|
||||
- **§5.4 wall-clock amplification sweep**(5 baseline × 3 runs,钉死 dispatch coupling 论证的实证 closure)
|
||||
- **Scale-out 验证**(dash1+dash2 = 16 GPU,等 dash0 + 3-node 可用时扩到 80 GPU)
|
||||
Reference in New Issue
Block a user