PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight

Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher, per-instance + cross-config plotters) is fully assembled and smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps on w600) is running on dash1. Also realigned the figure list with what `aggregate_mb5.py` actually produces (mb5_kv_timeline, mb5_peak_utilization, mb5_latency_compare, mb5_summary.csv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:51:28 +08:00
parent a66f24d242
commit b13ca10d19
1 changed files with 42 additions and 44 deletions
--- a/microbench/fresh_setup/PD_DISAGG_INVESTIGATION.md
+++ b/microbench/fresh_setup/PD_DISAGG_INVESTIGATION.md
@@ -10,6 +10,27 @@ Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态，以及每个

 **核心 gap**：我们只看到 "PD-disagg 表面性能差 10×" 的 headline 数字，但**没有 system-level breakdown 告诉 reviewer "为什么差，差在哪个组件"**。需要的是 D-pool occupancy / scheduler queue depth / KV transfer queue / GPU SM utilization 等系统指标的时间序列，能直接指出 bottleneck。

+### Progress（2026-05-27 16:50）
+
+**Phase 0 done**：MB5 pipeline 全部 standing up on dash1 fresh-venv:
+
+- `mb5_launch.sh`：8C / 6P+2D / 4P+4D / 2P+6D 单一 launcher；stop_all 包含 stale-port 守卫
+- `mb5_pd_proxy.py`：vendored copy of vLLM 官方 `mooncake_connector_proxy.py`，patch 了 `min_tokens` 在 prefill leg 上的兼容性 bug
+- `instrument_kv_snapshot.py`：patch V1 scheduler 暴露 `schedule()` 结束时的 per-request KV block 分配 + 修复 vLLM 0.18.1 `MooncakeConnectorWorker.bootstrap_server` 在 kv_consumer 模式下未初始化的 AttributeError
+- `plot_kv_pool_timeline.py`：per-instance KV pool 时间线（stacked-area）
+- `aggregate_mb5.py`：跨 config / 跨 rep 聚合，输出 4 张对比图 + 1 张 CSV
+
+**PD-disagg smoke (4P+4D × 20 reqs)**: 20/20 success, mean latency 3.9s, p99 17s, 8 PIDs 都写 snapshot（601 total）。
+对比 8C × 20 reqs 同样数据点会在 sweep 完成后给出。
+
+**Phase 1 在跑**：
+- RUN_TAG=`20260527_164040`
+- CONFIGS=`8C 6P+2D 4P+4D 2P+6D` × REPS=3
+- TRACE=`w600_r0.0015_st30.jsonl` (~13 min/rep)
+- ETA ~3 h
+
+**Pending**：sweep 结果 → 跑 `aggregate_mb5.py` → 写 Phase 2 system analysis。
+
 ---

 ## 4 条独立失败假设
@@ -30,52 +51,30 @@ Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态，以及每个
 **结论**：用 vLLM 仓库 ship 的官方 example
 `third_party/vllm/examples/online_serving/disaggregated_serving/mooncake_connector/`：

- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap，**直接支持任意 P:D 比例**（e.g., `PREFILL_GPUS=0,1,2,3 DECODE_GPUS=4,5,6,7` 起 4P+4D）
- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy，round-robin P + round-robin D，每个请求 fire-and-forget 到 P 做 `do_remote_decode={transfer_id}`，并行用 P 的 (bootstrap_addr, engine_id) 触发 D 做 `do_remote_prefill={remote_bootstrap_addr, remote_engine_id, transfer_id}`
+- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap，**直接支持任意 P:D 比例**
+- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy，round-robin P + round-robin D；vendored 到 `microbench/fresh_setup/mb5_pd_proxy.py`，加 `min_tokens=1` 修复

-部署形态：P 实例用 `kv_role:kv_producer`（带 bootstrap_server），D 实例用 `kv_role:kv_consumer`（**无 bootstrap_server**，正是之前我们撞到 `AttributeError` 的那个 mode）。
+部署形态：P 实例用 `kv_role:kv_producer`（带 bootstrap_server），D 实例用 `kv_role:kv_consumer`。后者的 `bootstrap_server` AttributeError 通过 `instrument_kv_snapshot.py` patch 修复。

- [ ] 修改 `run_mooncake_connector.sh` 适配我们环境：模型路径 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`，TP=1，所需 vLLM 启动 flags（`--max-model-len 200000` 等）
- [ ] 重新验证 `kv_consumer` 模式：之前我们碰到的 AttributeError 可能是 kv_transfer_params 不对（missing transfer_id 或字段名错），MB2 调试后我们已经 nail down 正确 handshake；官方 proxy 用的就是同样的 handshake，所以这条很可能直接 work
- [ ] 如果 kv_consumer 仍然报 AttributeError，加一个 minimal patch 到 vLLM（之前看过 `self.bootstrap_server` 在 consumer 模式没被初始化）—— 单行修复
+### TODO 0.2: 包装官方 launcher ✅ DONE

-### TODO 0.2: 包装官方 launcher
+`microbench/fresh_setup/mb5_launch.sh` 单一 launcher，支持 `8C / 6P+2D / 4P+4D / 2P+6D`；
+配套 `mb5_run.sh` orchestrator（CONFIG × REP 迭代，含 launch/replay/teardown）。

-直接基于 `run_mooncake_connector.sh` 改：
+### TODO 0.3: System-level instrumentation ✅ DONE (per-request KV)

- [ ] `microbench/fresh_setup/mb5_pd_launch.sh` —— 直接 sourceable，参数 `PREFILL_GPUS` / `DECODE_GPUS` / `PREFILL_PORTS` / `BOOTSTRAP_PORTS` / `DECODE_PORTS` / `PROXY_PORT`，启动 P + D + 官方 proxy
- [ ] 注入正确模型路径 + vLLM flags（`--max-model-len 200000 --gpu-memory-utilization 0.9 --enable-prefix-caching --max-num-batched-tokens 8192`）
- [ ] 配置覆盖：先做 4 个 — `8C colo` + `6P+2D` + `4P+4D` + `2P+6D`
- [ ] 注：colo baseline `8C` 不走这个 launcher，沿用 dash0 现有 8-instance unified setup（最公平的对照）
+选了比 `/metrics` 更深的层面：**patch V1 scheduler 直接 dump 每个 `schedule()` 回合的 per-request KV block 分配**（10 Hz throttle）。

-### TODO 0.3: System-level instrumentation
+- `instrument_kv_snapshot.py` 输出 schema: `{t_unix, step, total_blocks, free_blocks, used_blocks, running:[{req_id, n_blocks, n_computed, n_prompt, n_tokens, status}], waiting:[...]}`
+- 每个 EngineCore PID 一份 jsonl，集中写入 `MB5_LOG_DIR`
+- 跟 prometheus `/metrics` 比：(a) 不需要轮询，(b) 拿到 per-request 而不只是 aggregate，(c) 可以反推 D-pool 在某时刻被谁占着

-不止看 latency / success rate；要记 system 行为时间序列才能 attribute bottleneck。
+如后续需要 prometheus `/metrics`（admission denial 事件之类），可以再加一个 sampler；目前的 per-request 数据已经能撑住 Phase 1 + Phase 2 分析。

-**必须采集的 metrics**（每秒一次，每实例）：
+### TODO 0.4: D 池 occupancy timeline 可视化 ✅ DONE

- [ ] `vllm:gpu_cache_usage_perc` — KV pool 占用率（核心 H1 证据）
- [ ] `vllm:num_requests_running` — 并发 in-flight 数
- [ ] `vllm:num_requests_waiting` — scheduler 排队数
- [ ] `vllm:time_to_first_token_seconds` (histogram) — 每 instance TTFT 分布
- [ ] `vllm:time_per_output_token_seconds` (histogram) — TPOT 分布
- [ ] D 池请求 admission control 拒绝事件（vllm 拒收新请求时的事件）—— 看 vllm 是否暴露 metric
- [ ] Mooncake 侧：`send_blocks` 事件（MB2 instrument 已存在）；可选 transfer queue depth
-
-**实现**：
-
- [ ] 写 `metrics_sampler.py`：周期性 GET `/metrics` 解析 prometheus 文本，输出 jsonl
- [ ] 每个 instance 一个采样进程 / 或者一个集中采样进程拉所有 instance
- [ ] 输出 schema: `{t_unix, instance_id, role, kv_pool_perc, num_running, num_waiting, ...}`
-
-### TODO 0.4: D 池 occupancy timeline 可视化
-
- [ ] 写 `plot_d_pool_timeline.py`：heatmap 或 stacked area
-  - x 轴：trace replay 时间
-  - y 轴：每个 D-instance（heatmap）或 KV pool 总占用比（stacked area）
-  - 色彩：占用率 0–100%
-  - 标 90% 红线（"vllm stops admitting new requests" 阈值，参考 colleague 旧数据）
- [ ] Output: 每个 config 一张图，stacked 起来对比
+- `plot_kv_pool_timeline.py` —— per-instance 视图（stacked-area: 时间 × 块数 × per-request 色块；底下 waiting queue depth 子图）
+- `aggregate_mb5.py` —— 跨 config / 跨 rep 聚合视图（cluster-wide KV 时间线、peak 占用率 bar、latency p50/p90/p99 bar、summary CSV）

 ---

@@ -106,14 +105,13 @@ reps = 3
 - [ ] **TTFT per request scatter**（colored by which-instance）— 看 D 池满了的时候 TTFT 是不是直接挂掉
 - [ ] **Admission denial events**（如果 vllm 暴露）

-### Output figures
+### Output figures (`aggregate_mb5.py` will write all of these)

- [ ] `mb5_latency_bars.png` — config × TTFT/TPOT/E2E p90 bar
- [ ] `mb5_success.png` — success rate per config
- [ ] `mb5_wallclock.png` — 实测 trace replay 时间 vs 8C colo
- [ ] `mb5_d_pool_timeline.png` — 4 configs × 8 instances heatmap
- [ ] `mb5_queue_depth_timeline.png` — 同上结构
- [ ] `mb5_diagnostic_summary.png` — 1 张总图把上面 5 个塞进去给 paper 用
+- [ ] `figs/mb5/mb5_kv_timeline.png` — 4 panels (one per config), cluster-wide KV % 时间线，faint per-rep line + bold median
+- [ ] `figs/mb5/mb5_peak_utilization.png` — bar chart peak vs steady KV per config，含 ±std error bars
+- [ ] `figs/mb5/mb5_latency_compare.png` — bar chart p50/p90/p99 e2e latency per config
+- [ ] `figs/mb5/mb5_summary.csv` — flat per-(config, rep) 表（latency, KV, prefix cache, success rate）
+- [ ] （manual）`figs/mb5/mb5_per_instance_timeline.png` — pick 1 rep per config, plot per-instance stacked-area via `plot_kv_pool_timeline.py`，给 paper §3.x system breakdown 用

 ### 判定标准（H1 何时被 confirmed）