MB5 analysis: per-role KV split proves static-partition mismatch

aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:05:17 +08:00
parent e8980ce957
commit 8596135680
8 changed files with 424 additions and 33 deletions
--- a/microbench/fresh_setup/PD_DISAGG_RESULTS.md
+++ b/microbench/fresh_setup/PD_DISAGG_RESULTS.md
@@ -0,0 +1,227 @@
+# PD-disaggregation under an agentic workload — does it work?
+
+**Consolidated results doc.** Self-contained writeup of every PD-disagg
+argument and experiment, with figures inline. For the live experiment TODO
+list see [PD_DISAGG_INVESTIGATION.md](PD_DISAGG_INVESTIGATION.md).
+
+Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct
+· vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace:
+`w600_r0.0015_st30.jsonl` (1214 requests, agentic multi-turn).
+
+---
+
+## TL;DR (verdict)
+
+**No static prefill/decode split beats 8-way colocation (8C) on this agentic
+workload.** Every disaggregated ratio we tried is dominated by 8C on the
+metric the user actually feels (TTFT, end-to-end latency, request
+completion), and the failure *moves* with the ratio:
+
+- **D-heavy bottleneck** (6P+2D, 4P+4D): the decode pool saturates (peak
+  **99.6% / 97.5%**) while the prefill pool sits at **~30%** — half the
+  cluster's KV is stranded on the wrong side.
+- **P-heavy bottleneck** (2P+6D): the 2 prefill instances can't keep up,
+  the prefill pool jams at **99.7%**, **872 requests** pile up in the queue
+  and **91% of requests never complete**.
+- **8C** keeps a single elastic pool that absorbs whichever phase is hot at
+  the moment → steady utilization **34%**, **100% completion**, fastest
+  wall-clock, best p50/p90 latency.
+
+PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
+**TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation,
+request loss, and a total collapse of prefix-cache reuse under the stock
+round-robin router.
+
+This is the empirical backing for the paper's claim: **agentic workloads
+have time-varying P:D demand that no static partition can track; colocation
+wins because its pool is elastic.** (H1 *and* H2 from the investigation doc,
+unified by one mechanism.)
+
+---
+
+## 1. Why this experiment exists
+
+Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed
+that on the **phase-isolation axis alone**, PD-disagg actually *wins*: it
+removes prefill→decode interference, and the transfer cost is small relative
+to the interference it avoids. So "PD-disagg is bad for agentic" could not be
+argued from phase isolation — we needed a system-level experiment that
+measures the whole picture (queueing, pool capacity, cache reuse), not just
+the isolated phase cost.
+
+See [analysis/mb1](../../analysis/mb1) and [analysis/mb2](../../analysis/mb2)
+for that accounting. This doc is the system-level answer.
+
+---
+
+## 2. Setup
+
+| | |
+|---|---|
+| Configs | `8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split) |
+| PD routing | stock **round-robin** on both P and D (vLLM official `mooncake_connector_proxy`) |
+| Trace | `w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn |
+| Reps | 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed |
+| KV instrumentation | V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`) |
+
+8C is the fair baseline: 8 colocated instances, replayer round-robins across
+them directly (no proxy). PD configs route through the proxy.
+
+---
+
+## 3. Headline result — no PD ratio beats 8C
+
+All numbers are rep1.
+
+| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
+|---|---|---|---|---|
+| **completion** | **100%** | 100% | 100% | **9%** 💀 |
+| wall-clock (drain trace) | **2994 s** | 3419 s | 4171 s | 5762 s |
+| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
+| TTFT mean | **18.0 s** | 44.8 s | 70.0 s | 106.8 s |
+| TTFT p50 | **7.0 s** | 41.0 s | 56.4 s | 23.6 s |
+| TTFT p90 | **53.1 s** | 86.7 s | 153.1 s | 498 s |
+| E2E p50 | **10.8 s** | 44.5 s | 59.5 s | 26.3 s |
+| E2E p90 | **83.3 s** | 91.8 s | 157.1 s | 499 s |
+
+![e2e latency by config](../../figs/mb5/mb5_latency_compare.png)
+
+> ⚠️ **Read the percentiles with the completion rate.** Latency percentiles
+> are computed over *successful* requests only. 2P+6D's "p99 = 577 s" covers
+> just the 9% that finished — the other 91% never returned, so its real
+> experience is far worse than any latency bar suggests.
+
+8C wins p50 by **4×** and p90 decisively. The only metric where a PD config
+edges 8C is E2E **p99** (6P+2D 148 s vs 8C 194 s) — and that is the flip side
+of the next result.
+
+---
+
+## 4. The duality — PD wins TPOT, loses TTFT
+
+PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no
+prefill stealing decode steps, **inter-token latency is dramatically cleaner.**
+
+| TPOT | **8C** | 6P+2D | 4P+4D | 2P+6D |
+|---|---|---|---|---|
+| mean | 87 ms | 11 ms | 9 ms | 6 ms |
+| p90 | 230 ms | 18 ms | 14 ms | 8 ms |
+| p99 | **1129 ms** | **26 ms** | **20 ms** | **12 ms** |
+
+PD's TPOT p99 is **10–35× lower** — once a request reaches a dedicated decode
+instance it streams without interruption. 8C's 1.1 s TPOT p99 *is* the
+chunked-prefill interference tax (decode steps occasionally stalled behind an
+8k-token prefill chunk), consistent with MB1.
+
+**But the win is local.** TTFT inflates 2.5–6× because every request now pays
+P→D handoff + admission into a smaller, saturated decode pool. For this
+workload's modest output lengths, TTFT dominates total time, so the TPOT win
+never pays for itself. This is the cost/benefit imbalance made concrete:
+phase isolation is real, but it is the wrong thing to optimize when the pool
+is the binding constraint.
+
+---
+
+## 5. Root cause — per-role KV pool occupancy (the kill shot)
+
+The cluster-average KV utilization is *misleading* and nearly hid the result:
+
+![cluster KV timeline](../../figs/mb5/mb5_kv_timeline.png)
+
+6P+2D and 4P+4D look only ~42–46% utilized on cluster average — yet they have
+128–152 requests queued. The average hides that **one pool is pegged while
+the other idles.** Splitting the KV pool by role exposes it:
+
+![per-role KV pool: P-pool vs D-pool](../../figs/mb5/mb5_role_split.png)
+
+| Config | P-pool steady | D-pool steady | D-pool **peak** | binding side |
+|---|---|---|---|---|
+| 8C | — single shared pool — | 34% | 72% | none (elastic) |
+| 6P+2D | 31% | **74%** | **99.6%** | **decode** |
+| 4P+4D | 29% | **60%** | **97.5%** | **decode** |
+| 2P+6D | **92%** | 95% | 96% | **prefill** (P jams first) |
+
+![peak vs steady utilization](../../figs/mb5/mb5_peak_utilization.png)
+
+**The mechanism, unified:**
+
+- A static P:D split fixes the KV capacity on each side at deploy time.
+- The agentic workload's instantaneous P:D demand *drifts* (bursts of new
+  sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
+- Whichever side is undersized *for the current phase* saturates and
+  back-pressures the whole pipeline, while the other side's KV sits stranded.
+  - 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled
+    requests queue for a decode slot → TTFT explodes (this is **H1**).
+  - 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even
+    start → 872 queued, 91% dropped.
+- **8C colocation has no partition**: prefill and decode share one pool, so
+  the pool elastically reallocates to whichever phase is hot. Steady
+  utilization stays at 34% with 100% completion.
+
+This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
+mismatch)** turning out to be the *same* phenomenon seen from two ratios.
+
+---
+
+## 6. The routing handicap — and whether smarter routing rescues PD
+
+Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
+is not fundamental to disaggregation — it is the stock proxy round-robining
+the **prefill** side: consecutive turns of one agentic session land on
+*different* producers, so each turn re-prefills the whole conversation from
+scratch. That both inflates TTFT and piles extra load on the prefill pool
+(directly worsening the 2P+6D collapse).
+
+The correct PD scheduling policy (as the design argues): **P should be chosen
+by session affinity** (reuse the producer's prefix cache) while **D is chosen
+by load balance** (decode KV is freshly transferred per turn, so D gains
+nothing from affinity). We added this as an env-gated mode in the proxy
+(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
+round-robin) and re-ran the best-performing disaggregated config, **6P+2D**.
+
+> **Status: session-affinity 6P+2D run in progress.** Results below will be
+> filled in when it completes; the question it answers is *how much of the
+> gap to 8C does restoring prefix-cache reuse close.*
+
+<!-- SESSION_AFFINITY_RESULTS -->
+*(pending)*
+
+---
+
+## 7. Caveats / honesty
+
+- **Single rep** for this analysis. The earlier 3-rep sweep showed 8C and
+  4P+4D are tight run-to-run, but 6P+2D completion varied (rep1 100% vs rep2
+  56% vs rep3 80%) — i.e. the D-pool sits right at the cliff edge, so 6P+2D's
+  "100% rep1" is optimistic. The qualitative ranking is robust; exact numbers
+  on the marginal configs are not.
+- **Latency percentiles count successes only** (see §3 warning). For failing
+  configs the latency bars *understate* the damage.
+- **Round-robin baseline.** §6 addresses the routing fairness concern head-on
+  with a session-affinity re-run.
+- Trace is a single agentic workload; conclusions are about *this* class of
+  workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
+  serving.
+
+---
+
+## 8. Reproduce
+
+```bash
+# from repo root, after microbench/fresh_setup/deploy.sh dash1
+# 1. round-robin baseline sweep (1 rep)
+ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
+    bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'
+
+# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
+ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
+    --tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
+    --reduce-to mb5_runs/reduced_<tag>.json'
+
+# 3. pull the compact JSON, render figures locally
+scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
+.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
+    --from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5
+
+# session-affinity arm: prefix the run with MB5_P_ROUTING=session
+```