# PD-disagg 在 agentic 下到底为什么不 work — 系统行为级论证

Living TODO 文档。记 H1/H2/H3/H4 四条假设的实验状态，以及每个实验需要的 instrumentation、figures、判定标准。所有 microbench 文件名前缀沿用 `MB`。

---

## 当前状态（2026-05-27）

**问题陈述**：之前关于 "PD-disagg 在 agentic 失败" 的证据只有 colleague 一个数据点（TTFT p50 62×、success 52%，在 dash0 patched 栈上）。MB1+MB2 算正确账后 phase-isolation 反而 favor PD-disagg，**所以我们目前没有 paper-grade 实证证据**。本文档跟踪需要的实验和 system-level 分析。

**核心 gap**：我们只看到 "PD-disagg 表面性能差 10×" 的 headline 数字，但**没有 system-level breakdown 告诉 reviewer "为什么差，差在哪个组件"**。需要的是 D-pool occupancy / scheduler queue depth / KV transfer queue / GPU SM utilization 等系统指标的时间序列，能直接指出 bottleneck。

### Progress（2026-05-27 16:50）

**Phase 0 done**：MB5 pipeline 全部 standing up on dash1 fresh-venv:

- `mb5_launch.sh`：8C / 6P+2D / 4P+4D / 2P+6D 单一 launcher；stop_all 包含 stale-port 守卫
- `mb5_pd_proxy.py`：vendored copy of vLLM 官方 `mooncake_connector_proxy.py`，patch 了 `min_tokens` 在 prefill leg 上的兼容性 bug
- `instrument_kv_snapshot.py`：patch V1 scheduler 暴露 `schedule()` 结束时的 per-request KV block 分配 + 修复 vLLM 0.18.1 `MooncakeConnectorWorker.bootstrap_server` 在 kv_consumer 模式下未初始化的 AttributeError
- `plot_kv_pool_timeline.py`：per-instance KV pool 时间线（stacked-area）
- `aggregate_mb5.py`：跨 config / 跨 rep 聚合，输出 4 张对比图 + 1 张 CSV

**PD-disagg smoke (4P+4D × 20 reqs)**: 20/20 success, mean latency 3.9s, p99 17s, 8 PIDs 都写 snapshot（601 total）。
对比 8C × 20 reqs 同样数据点会在 sweep 完成后给出。

**Phase 1 在跑**：
- RUN_TAG=`20260527_164040`
- CONFIGS=`8C 6P+2D 4P+4D 2P+6D` × REPS=3
- TRACE=`w600_r0.0015_st30.jsonl` (~13 min/rep)
- ETA ~3 h

**Pending**：sweep 结果 → 跑 `aggregate_mb5.py` → 写 Phase 2 system analysis。

---

## 4 条独立失败假设

| 假设 | 描述 | 当前证据 | 状态 |
|---|---|---|---|
| **H1** D 池容量天花板 | p99 单请求 11.5 GiB / D-instance 38 GiB pool；4P+4D 减半系统 decode 容量；长 context overflow → 排队 | colleague 旧数据 + `f4b` 几何论证 | **need fresh-venv replication + system breakdown** |
| **H2** 静态分池 mismatch | 工作负载 P:D 比例随时间漂移；静态 ratio 无法适应 | 无 | not started |
| **H3** Cache reuse 退化 + P 池 hotspot | round-robin P 丢 affinity（APC 跌），sticky P 复现 §3.3 hot pin | 无（logical extension） | not started |
| **H4** 端到端 throughput 输 | 各 phase 看着 OK 但 wall-clock 输（dispatch coupling 放大） | 无 | not started |

---

## Phase 0 — Infrastructure（必须先 done）

### TODO 0.1: 找 vLLM 标准 PD-disagg 部署方案 ✅ DONE

**结论**：用 vLLM 仓库 ship 的官方 example
`third_party/vllm/examples/online_serving/disaggregated_serving/mooncake_connector/`：

- `run_mooncake_connector.sh` —— 参数化 P/D GPU 列表、ports、bootstrap，**直接支持任意 P:D 比例**
- `mooncake_connector_proxy.py` —— 官方 FastAPI proxy，round-robin P + round-robin D；vendored 到 `microbench/fresh_setup/mb5_pd_proxy.py`，加 `min_tokens=1` 修复

部署形态：P 实例用 `kv_role:kv_producer`（带 bootstrap_server），D 实例用 `kv_role:kv_consumer`。后者的 `bootstrap_server` AttributeError 通过 `instrument_kv_snapshot.py` patch 修复。

### TODO 0.2: 包装官方 launcher ✅ DONE

`microbench/fresh_setup/mb5_launch.sh` 单一 launcher，支持 `8C / 6P+2D / 4P+4D / 2P+6D`；
配套 `mb5_run.sh` orchestrator（CONFIG × REP 迭代，含 launch/replay/teardown）。

### TODO 0.3: System-level instrumentation ✅ DONE (per-request KV)

选了比 `/metrics` 更深的层面：**patch V1 scheduler 直接 dump 每个 `schedule()` 回合的 per-request KV block 分配**（10 Hz throttle）。

- `instrument_kv_snapshot.py` 输出 schema: `{t_unix, step, total_blocks, free_blocks, used_blocks, running:[{req_id, n_blocks, n_computed, n_prompt, n_tokens, status}], waiting:[...]}`
- 每个 EngineCore PID 一份 jsonl，集中写入 `MB5_LOG_DIR`
- 跟 prometheus `/metrics` 比：(a) 不需要轮询，(b) 拿到 per-request 而不只是 aggregate，(c) 可以反推 D-pool 在某时刻被谁占着

如后续需要 prometheus `/metrics`（admission denial 事件之类），可以再加一个 sampler；目前的 per-request 数据已经能撑住 Phase 1 + Phase 2 分析。

### TODO 0.4: D 池 occupancy timeline 可视化 ✅ DONE

- `plot_kv_pool_timeline.py` —— per-instance 视图（stacked-area: 时间 × 块数 × per-request 色块；底下 waiting queue depth 子图）
- `aggregate_mb5.py` —— 跨 config / 跨 rep 聚合视图（cluster-wide KV 时间线、peak 占用率 bar、latency p50/p90/p99 bar、summary CSV）

---

## Phase 1 — MB5+3: PD ratio sweep + D-pool occupancy 实测

**Question**: 任何 static P:D ratio 能否打过 8C colo？如果不能，瓶颈在哪？

### 矩阵

```
configs = [8C, 6P+2D, 4P+4D, 2P+6D]
trace = traces/w600_r0.0015_st30.jsonl (1.2k req, ~13 min if not stalled)
reps = 3
```

### Primary metrics (per config, mean of 3 reps)

- [ ] TTFT / TPOT / E2E (mean / p50 / p90 / p99)
- [ ] Success rate
- [ ] **Wall-clock time to drain trace**（→ dispatch coupling amplification）
- [ ] APC (effective prefix cache hit ratio)
- [ ] System throughput (req/s steady state)

### System breakdown (the kill-shot evidence)

- [ ] **D-pool occupancy timeline** per D-instance
- [ ] **Scheduler queue depth timeline**（waiting requests over time）
- [ ] **TTFT per request scatter**（colored by which-instance）— 看 D 池满了的时候 TTFT 是不是直接挂掉
- [ ] **Admission denial events**（如果 vllm 暴露）

### Output figures (`aggregate_mb5.py` will write all of these)

- [ ] `figs/mb5/mb5_kv_timeline.png` — 4 panels (one per config), cluster-wide KV % 时间线，faint per-rep line + bold median
- [ ] `figs/mb5/mb5_peak_utilization.png` — bar chart peak vs steady KV per config，含 ±std error bars
- [ ] `figs/mb5/mb5_latency_compare.png` — bar chart p50/p90/p99 e2e latency per config
- [ ] `figs/mb5/mb5_summary.csv` — flat per-(config, rep) 表（latency, KV, prefix cache, success rate）
- [ ] （manual）`figs/mb5/mb5_per_instance_timeline.png` — pick 1 rep per config, plot per-instance stacked-area via `plot_kv_pool_timeline.py`，给 paper §3.x system breakdown 用

### 判定标准（H1 何时被 confirmed）

H1 算 **被证实** 如果同时满足：
1. PD-disagg 在所有非 8C config 上 TTFT p90 ≥ 2× 8C colo
2. D-pool occupancy timeline 显示 D 池 ≥ 90% 持续超过 trace 时长的 30%
3. Scheduler queue depth 在 PD-disagg 下显著高于 8C colo
4. 至少一个 PD config 出现 success rate 下降

H1 算 **被证伪** 如果存在某 PD ratio 使所有指标和 8C colo 差不多 —— 那 §3.2 要重写。

---

## Phase 2 — System analysis（钉死 bottleneck）

Phase 1 数据出来后，写 system-level analysis：

- [ ] 对 4P+4D 配置：哪条 D-instance 哪段时间到 90%？对应那时进来的请求是不是大 KV 的？
- [ ] 比对 8C 和 4P+4D：8C 同样时间窗内每个 instance 的 occupancy 是什么样？是不是因为有 prefill workload 帮忙 churn 走？
- [ ] Wall-clock difference attribution：用 system metrics 把 PD-disagg 比 8C 慢的 X 秒分解成 (queue_wait + decode + transfer + ...)
- [ ] 看 vllm log 里有没有 "scheduler skipped due to no available KV blocks" 之类事件

预期写一段：
> "在 4P+4D 配置下，D 池在 trace 第 X 秒到第 Y 秒持续超过 90% 占用，这段时间 prefill 完成的请求（来自 P 池）平均排队 Z 秒才能进入 decode，导致 TTFT p90 从 8C 的 18s 涨到 60s。在 8C 配置下相同时刻 KV 池只到 60%，因为 prefill 完成后 immediately decode 而不需要排队 admission。"

---

## Phase 3 — MB4 cache-reuse × PD-routing（secondary）

只在 H1 没被钉死的情况下才优先做；否则当 supporting 实验。

- [ ] 4P+4D 固定，扫 P-routing：round-robin / session-sticky / kv-aware
- [ ] 测：APC, P-pool per-worker hotspot, end-to-end TTFT
- [ ] 预期：round-robin 丢 ~23pp APC；sticky P 池复现 §3.3 hot pin

---

## Phase 4 — 可选 scale-out（16 GPU on dash1+dash2）

- [ ] Phase 1 结论稳定后才动
- [ ] 配置：16C / 8P+8D / 4×(2P+2D)（节点内 disagg + 跨节点协调）
- [ ] 用 cost model（如有）外推 80 GPU

---

## 时间表（粗估）

- Phase 0（infrastructure）: 2-3 天 — 调研 vllm 标准部署 + 改 launcher + 加 metrics 采样
- Phase 1（MB5+3 sweep）: 1 天跑 + 1 天分析 = 2 天
- Phase 2（system analysis 写 paper section）: 1-2 天
- Phase 3（MB4）: 2-3 天（可选）

总计：~1 周钉死 H1 + Phase 2 写作。

---

## Notes / open questions

- vLLM 0.18.1 `kv_consumer` mode 的 `bootstrap_server` AttributeError 这次必须正面处理 —— 因为 standard PD-disagg 部署典型用 producer/consumer 角色，不是 kv_both
- D-pool occupancy timeline 是否要补到 MB3 的实验范围里？我倾向 yes（Phase 1 自然就 cover）
- Cost model（task #12）是否还需要做？取决于 Phase 1 结论是否需要 analytical 外推