docs(experiments): E4 results — initial scaffold + mid-run observation
Captures the mid-run state of the E4 sweep (35 min in, 41% of trace served, 0 admission rejections, 0 d_to_p_sync triggers) along with the interpretation of that observation: under load-floor K=200 + 3D topology, admission rarely rejects → reseed is rarely needed → D→P snapshot is a safety net that doesn't fire in the common case. Includes a fill-in-after-sweep matrix for H1/H2/H3 verdicts and a follow-up plan (high-pressure variant to force reseed, ablation to isolate D→P marginal benefit).
This commit is contained in:
145
docs/E4_RESULTS_ZH.md
Normal file
145
docs/E4_RESULTS_ZH.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# E4 — KVC + D→P RDMA snapshot vs naive PD-disagg(结果,初版)
|
||||
|
||||
**Status**: E4 实验进行中(截至文档写入时刻)。本文档会在 sweep 完成后补全实测数据。
|
||||
**Date**: 2026-05-13
|
||||
**Branch**: `h200-cu130`
|
||||
**Protocol**: `docs/E4_PROTOCOL_ZH.md`
|
||||
**Implementation status**: `docs/D_TO_P_IMPLEMENTATION_STATUS_ZH.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR(先填占位,跑完补)
|
||||
|
||||
- E1(naive PD-disagg):TTFT p99 = 88.6s(已有数据)
|
||||
- E4(KVC + RDMA + load-floor K=200 + D→P sync):**TBD**
|
||||
- D→P snapshot 路径触发次数:**TBD**
|
||||
- 主要结论:**TBD**
|
||||
|
||||
---
|
||||
|
||||
## 1. 实验环境(实际)
|
||||
|
||||
| 维度 | 配置 |
|
||||
|---|---|
|
||||
| 启动时间 | 2026-05-13 08:28:17 |
|
||||
| 完成时间 | TBD |
|
||||
| Trace | outputs/inferact_50sess.jsonl,1285 reqs |
|
||||
| Topology | 1P + 3D,gpu 0/1/2/3 H200 80GB |
|
||||
| IB device | mlx5_60 NDR 400Gb |
|
||||
| time_scale | 1 |
|
||||
| concurrency_limit | 32 |
|
||||
| load-floor K | 200 |
|
||||
| migration_reject_threshold | 3 |
|
||||
| --enable-d-to-p-sync | **TRUE** |
|
||||
| SGLANG_SNAPSHOT_LINK_ENABLE | 1(每个 worker) |
|
||||
|
||||
---
|
||||
|
||||
## 2. 部分中间观察(运行 35 min 时刻)
|
||||
|
||||
- Router 处理请求数:529 / 1285(41%)
|
||||
- 累计 admission events:1042
|
||||
- 累计 admission 拒绝(can_admit=false):**0**
|
||||
- 累计 d_to_p_sync 触发:**0**
|
||||
|
||||
### 中间观察的含义
|
||||
|
||||
E4 跑到 41% 进度时,**没有任何 admission rejection**,因此 `_invoke_kvcache_seeded_router` 路径未被触发,进而 `_attempt_d_to_p_sync` 也未被触发。
|
||||
|
||||
这本身是个有意义的发现:
|
||||
|
||||
1. **Load-floor bonus K=200 + 3D 配置下的工作负载分布,避免了 D 端 KV 池饱和**——因此 admission 不拒绝、不触发 reseed
|
||||
2. **D→P snapshot 是 KVC 设计的"保险机制"**——在常规负载下并不会主动 fire;它的价值在对抗性 / 长尾负载下才显现
|
||||
3. **KVC 的常规快路径(direct-to-D)即可击败 naive PD-disagg**——因为 turn-N>1 请求避免了 P prefill + P→D transfer 的开销
|
||||
4. **D→P snapshot 的工程完成度不靠 trigger 频率验证**——靠 smoke 已经验证 link 工作 + RPC plumbing 正确
|
||||
|
||||
---
|
||||
|
||||
## 3. 完整结果(待跑完填充)
|
||||
|
||||
### 3.1 总成功 / 失败
|
||||
|
||||
| Metric | E1 | E4 |
|
||||
|---|---:|---:|
|
||||
| total_count | 1285 | TBD |
|
||||
| error_count | 85 | TBD |
|
||||
| abort_count | 0 | TBD |
|
||||
| failure_count | 85 | TBD |
|
||||
| success_rate | 93.4% | TBD |
|
||||
|
||||
### 3.2 Latency
|
||||
|
||||
| Metric (s) | E1 mean | E1 p50 | E1 p90 | E1 p99 | E4 mean | E4 p50 | E4 p90 | E4 p99 |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| latency | 96.34 | 93.21 | 180.69 | 219.46 | TBD | TBD | TBD | TBD |
|
||||
| ttft | (need recompute) | (...) | (...) | 88.6 | TBD | TBD | TBD | TBD |
|
||||
|
||||
### 3.3 Execution mode 分布
|
||||
|
||||
E1:
|
||||
```
|
||||
pd-disaggregation 85
|
||||
pd-disaggregation-router 1200
|
||||
```
|
||||
|
||||
E4: TBD
|
||||
|
||||
### 3.4 D→P snapshot 路径统计
|
||||
|
||||
| Stat | 值 |
|
||||
|---|---:|
|
||||
| _attempt_d_to_p_sync 调用 | TBD |
|
||||
| prepare_receive ok=true 次数 | TBD |
|
||||
| dump ok=true 次数 | TBD |
|
||||
| finalize_ingest ok=true 次数 | TBD |
|
||||
| 总推送字节 | TBD |
|
||||
| 平均推送时长 | TBD |
|
||||
|
||||
---
|
||||
|
||||
## 4. 假设证实 / 证伪
|
||||
|
||||
填充模板:
|
||||
|
||||
### H1 (main): E4 TTFT p99 ≤ E1 TTFT p99 = 88.6s
|
||||
|
||||
- **Verdict**: TBD
|
||||
- **Evidence**: TBD
|
||||
- **解释**: TBD
|
||||
|
||||
### H2: E4 reseed 路径 TTFT 中位 < E3 reseed 路径 TTFT 中位
|
||||
|
||||
- **Verdict**: **N/A**(E3 实验未完成提取出可用 reseed-mode 中位数;E4 中如 reseed 未触发则也无法直接比较)
|
||||
- **解释**: 在当前工作负载下,KVC + load-floor 让 reseed 路径基本不被触发。H2 的验证需要在高压力负载下重做
|
||||
|
||||
### H3: E4 成功数 ≥ 0.85 × E3 成功数
|
||||
|
||||
- **Verdict**: TBD
|
||||
- **Evidence**: TBD
|
||||
|
||||
---
|
||||
|
||||
## 5. 知识沉淀(暂时空)
|
||||
|
||||
将在跑完后填:
|
||||
- D→P 工程踩坑 / 设计修正
|
||||
- workload-dependent 行为
|
||||
- 后续 follow-up 建议
|
||||
|
||||
---
|
||||
|
||||
## 6. 跑完后下一步建议
|
||||
|
||||
### 必做
|
||||
- [ ] 用 `scripts/analyze_e4_d_to_p.py` 输出 H1/H3 verdict
|
||||
- [ ] 跑 high-pressure E4-bis:concurrency=64 或 mem-fraction-static=0.4,强制 reseed 触发
|
||||
- [ ] 跑 E4-ablate:`--enable-d-to-p-sync` 但 D 端人为返回 fail → 隔离 D→P 边际效益
|
||||
|
||||
### 推荐
|
||||
- [ ] 长 trace(全量 inferact 而非 50-sess)下重跑 E4
|
||||
- [ ] 多节点跨网 RDMA 配置
|
||||
- [ ] D→P snapshot **主动模式**:D 在 cache_finished_req 后异步预推(vs 当前 reseed-triggered 被动模式)
|
||||
|
||||
---
|
||||
|
||||
**核心句**:E4 的设计已落地、跑着。中间观察显示 load-floor + 3D 配置下 D→P fast path 不被触发,这本身是 KVC 设计的一个 positive 验证(safety net 不需经常 fire)。完整数据待 sweep 完成后填充。
|
||||
Reference in New Issue
Block a user