PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight
Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher, per-instance + cross-config plotters) is fully assembled and smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps on w600) is running on dash1. Also realigned the figure list with what `aggregate_mb5.py` actually produces (mb5_kv_timeline, mb5_peak_utilization, mb5_latency_compare, mb5_summary.csv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -10,6 +10,27 @@ Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态,以及每个
|
||||
|
||||
**核心 gap**:我们只看到 "PD-disagg 表面性能差 10×" 的 headline 数字,但**没有 system-level breakdown 告诉 reviewer "为什么差,差在哪个组件"**。需要的是 D-pool occupancy / scheduler queue depth / KV transfer queue / GPU SM utilization 等系统指标的时间序列,能直接指出 bottleneck。
|
||||
|
||||
### Progress(2026-05-27 16:50)
|
||||
|
||||
**Phase 0 done**:MB5 pipeline 全部 standing up on dash1 fresh-venv:
|
||||
|
||||
- `mb5_launch.sh`:8C / 6P+2D / 4P+4D / 2P+6D 单一 launcher;stop_all 包含 stale-port 守卫
|
||||
- `mb5_pd_proxy.py`:vendored copy of vLLM 官方 `mooncake_connector_proxy.py`,patch 了 `min_tokens` 在 prefill leg 上的兼容性 bug
|
||||
- `instrument_kv_snapshot.py`:patch V1 scheduler 暴露 `schedule()` 结束时的 per-request KV block 分配 + 修复 vLLM 0.18.1 `MooncakeConnectorWorker.bootstrap_server` 在 kv_consumer 模式下未初始化的 AttributeError
|
||||
- `plot_kv_pool_timeline.py`:per-instance KV pool 时间线(stacked-area)
|
||||
- `aggregate_mb5.py`:跨 config / 跨 rep 聚合,输出 4 张对比图 + 1 张 CSV
|
||||
|
||||
**PD-disagg smoke (4P+4D × 20 reqs)**: 20/20 success, mean latency 3.9s, p99 17s, 8 PIDs 都写 snapshot(601 total)。
|
||||
对比 8C × 20 reqs 同样数据点会在 sweep 完成后给出。
|
||||
|
||||
**Phase 1 在跑**:
|
||||
- RUN_TAG=`20260527_164040`
|
||||
- CONFIGS=`8C 6P+2D 4P+4D 2P+6D` × REPS=3
|
||||
- TRACE=`w600_r0.0015_st30.jsonl` (~13 min/rep)
|
||||
- ETA ~3 h
|
||||
|
||||
**Pending**:sweep 结果 → 跑 `aggregate_mb5.py` → 写 Phase 2 system analysis。
|
||||
|
||||
---
|
||||
|
||||
## 4 条独立失败假设
|
||||
@@ -30,52 +51,30 @@ Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态,以及每个
|
||||
**结论**:用 vLLM 仓库 ship 的官方 example
|
||||
`third_party/vllm/examples/online_serving/disaggregated_serving/mooncake_connector/`:
|
||||
|
||||
- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap,**直接支持任意 P:D 比例**(e.g., `PREFILL_GPUS=0,1,2,3 DECODE_GPUS=4,5,6,7` 起 4P+4D)
|
||||
- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy,round-robin P + round-robin D,每个请求 fire-and-forget 到 P 做 `do_remote_decode={transfer_id}`,并行用 P 的 (bootstrap_addr, engine_id) 触发 D 做 `do_remote_prefill={remote_bootstrap_addr, remote_engine_id, transfer_id}`
|
||||
- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap,**直接支持任意 P:D 比例**
|
||||
- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy,round-robin P + round-robin D;vendored 到 `microbench/fresh_setup/mb5_pd_proxy.py`,加 `min_tokens=1` 修复
|
||||
|
||||
部署形态:P 实例用 `kv_role:kv_producer`(带 bootstrap_server),D 实例用 `kv_role:kv_consumer`(**无 bootstrap_server**,正是之前我们撞到 `AttributeError` 的那个 mode)。
|
||||
部署形态:P 实例用 `kv_role:kv_producer`(带 bootstrap_server),D 实例用 `kv_role:kv_consumer`。后者的 `bootstrap_server` AttributeError 通过 `instrument_kv_snapshot.py` patch 修复。
|
||||
|
||||
- [ ] 修改 `run_mooncake_connector.sh` 适配我们环境:模型路径 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`,TP=1,所需 vLLM 启动 flags(`--max-model-len 200000` 等)
|
||||
- [ ] 重新验证 `kv_consumer` 模式:之前我们碰到的 AttributeError 可能是 kv_transfer_params 不对(missing transfer_id 或字段名错),MB2 调试后我们已经 nail down 正确 handshake;官方 proxy 用的就是同样的 handshake,所以这条很可能直接 work
|
||||
- [ ] 如果 kv_consumer 仍然报 AttributeError,加一个 minimal patch 到 vLLM(之前看过 `self.bootstrap_server` 在 consumer 模式没被初始化)—— 单行修复
|
||||
### TODO 0.2: 包装官方 launcher ✅ DONE
|
||||
|
||||
### TODO 0.2: 包装官方 launcher
|
||||
`microbench/fresh_setup/mb5_launch.sh` 单一 launcher,支持 `8C / 6P+2D / 4P+4D / 2P+6D`;
|
||||
配套 `mb5_run.sh` orchestrator(CONFIG × REP 迭代,含 launch/replay/teardown)。
|
||||
|
||||
直接基于 `run_mooncake_connector.sh` 改:
|
||||
### TODO 0.3: System-level instrumentation ✅ DONE (per-request KV)
|
||||
|
||||
- [ ] `microbench/fresh_setup/mb5_pd_launch.sh` —— 直接 sourceable,参数 `PREFILL_GPUS` / `DECODE_GPUS` / `PREFILL_PORTS` / `BOOTSTRAP_PORTS` / `DECODE_PORTS` / `PROXY_PORT`,启动 P + D + 官方 proxy
|
||||
- [ ] 注入正确模型路径 + vLLM flags(`--max-model-len 200000 --gpu-memory-utilization 0.9 --enable-prefix-caching --max-num-batched-tokens 8192`)
|
||||
- [ ] 配置覆盖:先做 4 个 — `8C colo` + `6P+2D` + `4P+4D` + `2P+6D`
|
||||
- [ ] 注:colo baseline `8C` 不走这个 launcher,沿用 dash0 现有 8-instance unified setup(最公平的对照)
|
||||
选了比 `/metrics` 更深的层面:**patch V1 scheduler 直接 dump 每个 `schedule()` 回合的 per-request KV block 分配**(10 Hz throttle)。
|
||||
|
||||
### TODO 0.3: System-level instrumentation
|
||||
- `instrument_kv_snapshot.py` 输出 schema: `{t_unix, step, total_blocks, free_blocks, used_blocks, running:[{req_id, n_blocks, n_computed, n_prompt, n_tokens, status}], waiting:[...]}`
|
||||
- 每个 EngineCore PID 一份 jsonl,集中写入 `MB5_LOG_DIR`
|
||||
- 跟 prometheus `/metrics` 比:(a) 不需要轮询,(b) 拿到 per-request 而不只是 aggregate,(c) 可以反推 D-pool 在某时刻被谁占着
|
||||
|
||||
不止看 latency / success rate;要记 system 行为时间序列才能 attribute bottleneck。
|
||||
如后续需要 prometheus `/metrics`(admission denial 事件之类),可以再加一个 sampler;目前的 per-request 数据已经能撑住 Phase 1 + Phase 2 分析。
|
||||
|
||||
**必须采集的 metrics**(每秒一次,每实例):
|
||||
### TODO 0.4: D 池 occupancy timeline 可视化 ✅ DONE
|
||||
|
||||
- [ ] `vllm:gpu_cache_usage_perc` — KV pool 占用率(核心 H1 证据)
|
||||
- [ ] `vllm:num_requests_running` — 并发 in-flight 数
|
||||
- [ ] `vllm:num_requests_waiting` — scheduler 排队数
|
||||
- [ ] `vllm:time_to_first_token_seconds` (histogram) — 每 instance TTFT 分布
|
||||
- [ ] `vllm:time_per_output_token_seconds` (histogram) — TPOT 分布
|
||||
- [ ] D 池请求 admission control 拒绝事件(vllm 拒收新请求时的事件)—— 看 vllm 是否暴露 metric
|
||||
- [ ] Mooncake 侧:`send_blocks` 事件(MB2 instrument 已存在);可选 transfer queue depth
|
||||
|
||||
**实现**:
|
||||
|
||||
- [ ] 写 `metrics_sampler.py`:周期性 GET `/metrics` 解析 prometheus 文本,输出 jsonl
|
||||
- [ ] 每个 instance 一个采样进程 / 或者一个集中采样进程拉所有 instance
|
||||
- [ ] 输出 schema: `{t_unix, instance_id, role, kv_pool_perc, num_running, num_waiting, ...}`
|
||||
|
||||
### TODO 0.4: D 池 occupancy timeline 可视化
|
||||
|
||||
- [ ] 写 `plot_d_pool_timeline.py`:heatmap 或 stacked area
|
||||
- x 轴:trace replay 时间
|
||||
- y 轴:每个 D-instance(heatmap)或 KV pool 总占用比(stacked area)
|
||||
- 色彩:占用率 0–100%
|
||||
- 标 90% 红线("vllm stops admitting new requests" 阈值,参考 colleague 旧数据)
|
||||
- [ ] Output: 每个 config 一张图,stacked 起来对比
|
||||
- `plot_kv_pool_timeline.py` —— per-instance 视图(stacked-area: 时间 × 块数 × per-request 色块;底下 waiting queue depth 子图)
|
||||
- `aggregate_mb5.py` —— 跨 config / 跨 rep 聚合视图(cluster-wide KV 时间线、peak 占用率 bar、latency p50/p90/p99 bar、summary CSV)
|
||||
|
||||
---
|
||||
|
||||
@@ -106,14 +105,13 @@ reps = 3
|
||||
- [ ] **TTFT per request scatter**(colored by which-instance)— 看 D 池满了的时候 TTFT 是不是直接挂掉
|
||||
- [ ] **Admission denial events**(如果 vllm 暴露)
|
||||
|
||||
### Output figures
|
||||
### Output figures (`aggregate_mb5.py` will write all of these)
|
||||
|
||||
- [ ] `mb5_latency_bars.png` — config × TTFT/TPOT/E2E p90 bar
|
||||
- [ ] `mb5_success.png` — success rate per config
|
||||
- [ ] `mb5_wallclock.png` — 实测 trace replay 时间 vs 8C colo
|
||||
- [ ] `mb5_d_pool_timeline.png` — 4 configs × 8 instances heatmap
|
||||
- [ ] `mb5_queue_depth_timeline.png` — 同上结构
|
||||
- [ ] `mb5_diagnostic_summary.png` — 1 张总图把上面 5 个塞进去给 paper 用
|
||||
- [ ] `figs/mb5/mb5_kv_timeline.png` — 4 panels (one per config), cluster-wide KV % 时间线,faint per-rep line + bold median
|
||||
- [ ] `figs/mb5/mb5_peak_utilization.png` — bar chart peak vs steady KV per config,含 ±std error bars
|
||||
- [ ] `figs/mb5/mb5_latency_compare.png` — bar chart p50/p90/p99 e2e latency per config
|
||||
- [ ] `figs/mb5/mb5_summary.csv` — flat per-(config, rep) 表(latency, KV, prefix cache, success rate)
|
||||
- [ ] (manual)`figs/mb5/mb5_per_instance_timeline.png` — pick 1 rep per config, plot per-instance stacked-area via `plot_kv_pool_timeline.py`,给 paper §3.x system breakdown 用
|
||||
|
||||
### 判定标准(H1 何时被 confirmed)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user