Add PD_DISAGG_INVESTIGATION.md — living TODO for proving H1–H4
We don't have paper-grade evidence yet that PD-disagg fails in agentic. MB1+MB2 corrected accounting puts phase-isolation cost-benefit on PD-disagg's side; the only direct support is colleague's one data point on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b geometric capacity argument. To close §3.2 properly we need fresh-venv empirical replication PLUS system-level instrumentation that tells the reviewer *which* component is the bottleneck — not just headline latency. This document tracks the four candidate failure hypotheses (H1 D-pool capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot, H4 end-to-end throughput loss), their current evidence status, and the phased experiment plan to address each. Key findings already recorded: - Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM ships an official example at examples/online_serving/disaggregated_serving/mooncake_connector/ with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy that supports arbitrary P:D ratios via env vars. Per user direction, we will NOT polish PD-disagg policy ourselves; we use the official recipe as the "PD-disagg" baseline in §3.2 / §5.2. - Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging) is the critical path. Designed to either confirm H1 with system breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth spike) or falsify it (some ratio matches 8C colo, in which case §3.2 needs rewriting). - D-pool occupancy timeline is the single most important new instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is 10× worse BECAUSE the D pool sits at >90% for X% of the trace". Configurations to run on dash1 8-GPU first: 8C (colo baseline), 6P+2D, 4P+4D, 2P+6D × 3 reps × w600 trace. Open question still in the doc: vLLM 0.18.1 had an AttributeError on self.bootstrap_server in kv_consumer mode when we hit it during MB2 sanity; likely the issue was bad kv_transfer_params from our side (missing transfer_id, wrong field names), which we have since fixed. Official proxy uses the same handshake we now have, so it should just work. If not, single-line patch to initialize self.bootstrap_server = None for consumer mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
177
microbench/fresh_setup/PD_DISAGG_INVESTIGATION.md
Normal file
177
microbench/fresh_setup/PD_DISAGG_INVESTIGATION.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# PD-disagg 在 agentic 下到底为什么不 work — 系统行为级论证
|
||||
|
||||
Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态,以及每个实验需要的 instrumentation、figures、判定标准。所有 microbench 文件名前缀沿用 `MB`。
|
||||
|
||||
---
|
||||
|
||||
## 当前状态(2026-05-27)
|
||||
|
||||
**问题陈述**:之前关于 "PD-disagg 在 agentic 失败" 的证据只有 colleague 一个数据点(TTFT p50 62×、success 52%,在 dash0 patched 栈上)。MB1+MB2 算正确账后 phase-isolation 反而 favor PD-disagg,**所以我们目前没有 paper-grade 实证证据**。本文档跟踪需要的实验和 system-level 分析。
|
||||
|
||||
**核心 gap**:我们只看到 "PD-disagg 表面性能差 10×" 的 headline 数字,但**没有 system-level breakdown 告诉 reviewer "为什么差,差在哪个组件"**。需要的是 D-pool occupancy / scheduler queue depth / KV transfer queue / GPU SM utilization 等系统指标的时间序列,能直接指出 bottleneck。
|
||||
|
||||
---
|
||||
|
||||
## 4 条独立失败假设
|
||||
|
||||
| 假设 | 描述 | 当前证据 | 状态 |
|
||||
|---|---|---|---|
|
||||
| **H1** D 池容量天花板 | p99 单请求 11.5 GiB / D-instance 38 GiB pool;4P+4D 减半系统 decode 容量;长 context overflow → 排队 | colleague 旧数据 + `f4b` 几何论证 | **need fresh-venv replication + system breakdown** |
|
||||
| **H2** 静态分池 mismatch | 工作负载 P:D 比例随时间漂移;静态 ratio 无法适应 | 无 | not started |
|
||||
| **H3** Cache reuse 退化 + P 池 hotspot | round-robin P 丢 affinity(APC 跌),sticky P 复现 §3.3 hot pin | 无(logical extension) | not started |
|
||||
| **H4** 端到端 throughput 输 | 各 phase 看着 OK 但 wall-clock 输(dispatch coupling 放大) | 无 | not started |
|
||||
|
||||
---
|
||||
|
||||
## Phase 0 — Infrastructure(必须先 done)
|
||||
|
||||
### TODO 0.1: 找 vLLM 标准 PD-disagg 部署方案 ✅ DONE
|
||||
|
||||
**结论**:用 vLLM 仓库 ship 的官方 example
|
||||
`third_party/vllm/examples/online_serving/disaggregated_serving/mooncake_connector/`:
|
||||
|
||||
- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap,**直接支持任意 P:D 比例**(e.g., `PREFILL_GPUS=0,1,2,3 DECODE_GPUS=4,5,6,7` 起 4P+4D)
|
||||
- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy,round-robin P + round-robin D,每个请求 fire-and-forget 到 P 做 `do_remote_decode={transfer_id}`,并行用 P 的 (bootstrap_addr, engine_id) 触发 D 做 `do_remote_prefill={remote_bootstrap_addr, remote_engine_id, transfer_id}`
|
||||
|
||||
部署形态:P 实例用 `kv_role:kv_producer`(带 bootstrap_server),D 实例用 `kv_role:kv_consumer`(**无 bootstrap_server**,正是之前我们撞到 `AttributeError` 的那个 mode)。
|
||||
|
||||
- [ ] 修改 `run_mooncake_connector.sh` 适配我们环境:模型路径 `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`,TP=1,所需 vLLM 启动 flags(`--max-model-len 200000` 等)
|
||||
- [ ] 重新验证 `kv_consumer` 模式:之前我们碰到的 AttributeError 可能是 kv_transfer_params 不对(missing transfer_id 或字段名错),MB2 调试后我们已经 nail down 正确 handshake;官方 proxy 用的就是同样的 handshake,所以这条很可能直接 work
|
||||
- [ ] 如果 kv_consumer 仍然报 AttributeError,加一个 minimal patch 到 vLLM(之前看过 `self.bootstrap_server` 在 consumer 模式没被初始化)—— 单行修复
|
||||
|
||||
### TODO 0.2: 包装官方 launcher
|
||||
|
||||
直接基于 `run_mooncake_connector.sh` 改:
|
||||
|
||||
- [ ] `microbench/fresh_setup/mb5_pd_launch.sh` —— 直接 sourceable,参数 `PREFILL_GPUS` / `DECODE_GPUS` / `PREFILL_PORTS` / `BOOTSTRAP_PORTS` / `DECODE_PORTS` / `PROXY_PORT`,启动 P + D + 官方 proxy
|
||||
- [ ] 注入正确模型路径 + vLLM flags(`--max-model-len 200000 --gpu-memory-utilization 0.9 --enable-prefix-caching --max-num-batched-tokens 8192`)
|
||||
- [ ] 配置覆盖:先做 4 个 — `8C colo` + `6P+2D` + `4P+4D` + `2P+6D`
|
||||
- [ ] 注:colo baseline `8C` 不走这个 launcher,沿用 dash0 现有 8-instance unified setup(最公平的对照)
|
||||
|
||||
### TODO 0.3: System-level instrumentation
|
||||
|
||||
不止看 latency / success rate;要记 system 行为时间序列才能 attribute bottleneck。
|
||||
|
||||
**必须采集的 metrics**(每秒一次,每实例):
|
||||
|
||||
- [ ] `vllm:gpu_cache_usage_perc` — KV pool 占用率(核心 H1 证据)
|
||||
- [ ] `vllm:num_requests_running` — 并发 in-flight 数
|
||||
- [ ] `vllm:num_requests_waiting` — scheduler 排队数
|
||||
- [ ] `vllm:time_to_first_token_seconds` (histogram) — 每 instance TTFT 分布
|
||||
- [ ] `vllm:time_per_output_token_seconds` (histogram) — TPOT 分布
|
||||
- [ ] D 池请求 admission control 拒绝事件(vllm 拒收新请求时的事件)—— 看 vllm 是否暴露 metric
|
||||
- [ ] Mooncake 侧:`send_blocks` 事件(MB2 instrument 已存在);可选 transfer queue depth
|
||||
|
||||
**实现**:
|
||||
|
||||
- [ ] 写 `metrics_sampler.py`:周期性 GET `/metrics` 解析 prometheus 文本,输出 jsonl
|
||||
- [ ] 每个 instance 一个采样进程 / 或者一个集中采样进程拉所有 instance
|
||||
- [ ] 输出 schema: `{t_unix, instance_id, role, kv_pool_perc, num_running, num_waiting, ...}`
|
||||
|
||||
### TODO 0.4: D 池 occupancy timeline 可视化
|
||||
|
||||
- [ ] 写 `plot_d_pool_timeline.py`:heatmap 或 stacked area
|
||||
- x 轴:trace replay 时间
|
||||
- y 轴:每个 D-instance(heatmap)或 KV pool 总占用比(stacked area)
|
||||
- 色彩:占用率 0–100%
|
||||
- 标 90% 红线("vllm stops admitting new requests" 阈值,参考 colleague 旧数据)
|
||||
- [ ] Output: 每个 config 一张图,stacked 起来对比
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — MB5+3: PD ratio sweep + D-pool occupancy 实测
|
||||
|
||||
**Question**: 任何 static P:D ratio 能否打过 8C colo?如果不能,瓶颈在哪?
|
||||
|
||||
### 矩阵
|
||||
|
||||
```
|
||||
configs = [8C, 6P+2D, 4P+4D, 2P+6D]
|
||||
trace = traces/w600_r0.0015_st30.jsonl (1.2k req, ~13 min if not stalled)
|
||||
reps = 3
|
||||
```
|
||||
|
||||
### Primary metrics (per config, mean of 3 reps)
|
||||
|
||||
- [ ] TTFT / TPOT / E2E (mean / p50 / p90 / p99)
|
||||
- [ ] Success rate
|
||||
- [ ] **Wall-clock time to drain trace**(→ dispatch coupling amplification)
|
||||
- [ ] APC (effective prefix cache hit ratio)
|
||||
- [ ] System throughput (req/s steady state)
|
||||
|
||||
### System breakdown (the kill-shot evidence)
|
||||
|
||||
- [ ] **D-pool occupancy timeline** per D-instance
|
||||
- [ ] **Scheduler queue depth timeline**(waiting requests over time)
|
||||
- [ ] **TTFT per request scatter**(colored by which-instance)— 看 D 池满了的时候 TTFT 是不是直接挂掉
|
||||
- [ ] **Admission denial events**(如果 vllm 暴露)
|
||||
|
||||
### Output figures
|
||||
|
||||
- [ ] `mb5_latency_bars.png` — config × TTFT/TPOT/E2E p90 bar
|
||||
- [ ] `mb5_success.png` — success rate per config
|
||||
- [ ] `mb5_wallclock.png` — 实测 trace replay 时间 vs 8C colo
|
||||
- [ ] `mb5_d_pool_timeline.png` — 4 configs × 8 instances heatmap
|
||||
- [ ] `mb5_queue_depth_timeline.png` — 同上结构
|
||||
- [ ] `mb5_diagnostic_summary.png` — 1 张总图把上面 5 个塞进去给 paper 用
|
||||
|
||||
### 判定标准(H1 何时被 confirmed)
|
||||
|
||||
H1 算 **被证实** 如果同时满足:
|
||||
1. PD-disagg 在所有非 8C config 上 TTFT p90 ≥ 2× 8C colo
|
||||
2. D-pool occupancy timeline 显示 D 池 ≥ 90% 持续超过 trace 时长的 30%
|
||||
3. Scheduler queue depth 在 PD-disagg 下显著高于 8C colo
|
||||
4. 至少一个 PD config 出现 success rate 下降
|
||||
|
||||
H1 算 **被证伪** 如果存在某 PD ratio 使所有指标和 8C colo 差不多 —— 那 §3.2 要重写。
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — System analysis(钉死 bottleneck)
|
||||
|
||||
Phase 1 数据出来后,写 system-level analysis:
|
||||
|
||||
- [ ] 对 4P+4D 配置:哪条 D-instance 哪段时间到 90%?对应那时进来的请求是不是大 KV 的?
|
||||
- [ ] 比对 8C 和 4P+4D:8C 同样时间窗内每个 instance 的 occupancy 是什么样?是不是因为有 prefill workload 帮忙 churn 走?
|
||||
- [ ] Wall-clock difference attribution:用 system metrics 把 PD-disagg 比 8C 慢的 X 秒分解成 (queue_wait + decode + transfer + ...)
|
||||
- [ ] 看 vllm log 里有没有 "scheduler skipped due to no available KV blocks" 之类事件
|
||||
|
||||
预期写一段:
|
||||
> "在 4P+4D 配置下,D 池在 trace 第 X 秒到第 Y 秒持续超过 90% 占用,这段时间 prefill 完成的请求(来自 P 池)平均排队 Z 秒才能进入 decode,导致 TTFT p90 从 8C 的 18s 涨到 60s。在 8C 配置下相同时刻 KV 池只到 60%,因为 prefill 完成后 immediately decode 而不需要排队 admission。"
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — MB4 cache-reuse × PD-routing(secondary)
|
||||
|
||||
只在 H1 没被钉死的情况下才优先做;否则当 supporting 实验。
|
||||
|
||||
- [ ] 4P+4D 固定,扫 P-routing:round-robin / session-sticky / kv-aware
|
||||
- [ ] 测:APC, P-pool per-worker hotspot, end-to-end TTFT
|
||||
- [ ] 预期:round-robin 丢 ~23pp APC;sticky P 池复现 §3.3 hot pin
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — 可选 scale-out(16 GPU on dash1+dash2)
|
||||
|
||||
- [ ] Phase 1 结论稳定后才动
|
||||
- [ ] 配置:16C / 8P+8D / 4×(2P+2D)(节点内 disagg + 跨节点协调)
|
||||
- [ ] 用 cost model(如有)外推 80 GPU
|
||||
|
||||
---
|
||||
|
||||
## 时间表(粗估)
|
||||
|
||||
- Phase 0(infrastructure): 2-3 天 — 调研 vllm 标准部署 + 改 launcher + 加 metrics 采样
|
||||
- Phase 1(MB5+3 sweep): 1 天跑 + 1 天分析 = 2 天
|
||||
- Phase 2(system analysis 写 paper section): 1-2 天
|
||||
- Phase 3(MB4): 2-3 天(可选)
|
||||
|
||||
总计:~1 周钉死 H1 + Phase 2 写作。
|
||||
|
||||
---
|
||||
|
||||
## Notes / open questions
|
||||
|
||||
- vLLM 0.18.1 `kv_consumer` mode 的 `bootstrap_server` AttributeError 这次必须正面处理 —— 因为 standard PD-disagg 部署典型用 producer/consumer 角色,不是 kv_both
|
||||
- D-pool occupancy timeline 是否要补到 MB3 的实验范围里?我倾向 yes(Phase 1 自然就 cover)
|
||||
- Cost model(task #12)是否还需要做?取决于 Phase 1 结论是否需要 analytical 外推
|
||||
Reference in New Issue
Block a user