Files
agentic-kvc/microbench/fresh_setup/PD_DISAGG_RESULTS.md
Gahow Wang 8596135680 MB5 analysis: per-role KV split proves static-partition mismatch
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
  PID->role map parsed from vllm_logs filenames. The cluster average
  hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
  is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
  host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
  (matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.

PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.

Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 12:05:17 +08:00

9.9 KiB
Raw Blame History

PD-disaggregation under an agentic workload — does it work?

Consolidated results doc. Self-contained writeup of every PD-disagg argument and experiment, with figures inline. For the live experiment TODO list see PD_DISAGG_INVESTIGATION.md.

Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct · vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace: w600_r0.0015_st30.jsonl (1214 requests, agentic multi-turn).


TL;DR (verdict)

No static prefill/decode split beats 8-way colocation (8C) on this agentic workload. Every disaggregated ratio we tried is dominated by 8C on the metric the user actually feels (TTFT, end-to-end latency, request completion), and the failure moves with the ratio:

  • D-heavy bottleneck (6P+2D, 4P+4D): the decode pool saturates (peak 99.6% / 97.5%) while the prefill pool sits at ~30% — half the cluster's KV is stranded on the wrong side.
  • P-heavy bottleneck (2P+6D): the 2 prefill instances can't keep up, the prefill pool jams at 99.7%, 872 requests pile up in the queue and 91% of requests never complete.
  • 8C keeps a single elastic pool that absorbs whichever phase is hot at the moment → steady utilization 34%, 100% completion, fastest wall-clock, best p50/p90 latency.

PD-disagg does deliver the phase-isolation win we predicted in MB1 — its TPOT is 1035× cleaner — but that win is swamped by TTFT inflation, request loss, and a total collapse of prefix-cache reuse under the stock round-robin router.

This is the empirical backing for the paper's claim: agentic workloads have time-varying P:D demand that no static partition can track; colocation wins because its pool is elastic. (H1 and H2 from the investigation doc, unified by one mechanism.)


1. Why this experiment exists

Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed that on the phase-isolation axis alone, PD-disagg actually wins: it removes prefill→decode interference, and the transfer cost is small relative to the interference it avoids. So "PD-disagg is bad for agentic" could not be argued from phase isolation — we needed a system-level experiment that measures the whole picture (queueing, pool capacity, cache reuse), not just the isolated phase cost.

See analysis/mb1 and analysis/mb2 for that accounting. This doc is the system-level answer.


2. Setup

Configs 8C (8× kv_both colo), 6P+2D, 4P+4D, 2P+6D (prefill+decode split)
PD routing stock round-robin on both P and D (vLLM official mooncake_connector_proxy)
Trace w600_r0.0015_st30.jsonl, 1214 requests, agentic multi-turn
Reps 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed
KV instrumentation V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see instrument_kv_snapshot.py)

8C is the fair baseline: 8 colocated instances, replayer round-robins across them directly (no proxy). PD configs route through the proxy.


3. Headline result — no PD ratio beats 8C

All numbers are rep1.

Metric 8C 6P+2D 4P+4D 2P+6D
completion 100% 100% 100% 9% 💀
wall-clock (drain trace) 2994 s 3419 s 4171 s 5762 s
prefix-cache hit 19.4% 0% 0% 0%
TTFT mean 18.0 s 44.8 s 70.0 s 106.8 s
TTFT p50 7.0 s 41.0 s 56.4 s 23.6 s
TTFT p90 53.1 s 86.7 s 153.1 s 498 s
E2E p50 10.8 s 44.5 s 59.5 s 26.3 s
E2E p90 83.3 s 91.8 s 157.1 s 499 s

e2e latency by config

⚠️ Read the percentiles with the completion rate. Latency percentiles are computed over successful requests only. 2P+6D's "p99 = 577 s" covers just the 9% that finished — the other 91% never returned, so its real experience is far worse than any latency bar suggests.

8C wins p50 by 4× and p90 decisively. The only metric where a PD config edges 8C is E2E p99 (6P+2D 148 s vs 8C 194 s) — and that is the flip side of the next result.


4. The duality — PD wins TPOT, loses TTFT

PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no prefill stealing decode steps, inter-token latency is dramatically cleaner.

TPOT 8C 6P+2D 4P+4D 2P+6D
mean 87 ms 11 ms 9 ms 6 ms
p90 230 ms 18 ms 14 ms 8 ms
p99 1129 ms 26 ms 20 ms 12 ms

PD's TPOT p99 is 1035× lower — once a request reaches a dedicated decode instance it streams without interruption. 8C's 1.1 s TPOT p99 is the chunked-prefill interference tax (decode steps occasionally stalled behind an 8k-token prefill chunk), consistent with MB1.

But the win is local. TTFT inflates 2.56× because every request now pays P→D handoff + admission into a smaller, saturated decode pool. For this workload's modest output lengths, TTFT dominates total time, so the TPOT win never pays for itself. This is the cost/benefit imbalance made concrete: phase isolation is real, but it is the wrong thing to optimize when the pool is the binding constraint.


5. Root cause — per-role KV pool occupancy (the kill shot)

The cluster-average KV utilization is misleading and nearly hid the result:

cluster KV timeline

6P+2D and 4P+4D look only ~4246% utilized on cluster average — yet they have 128152 requests queued. The average hides that one pool is pegged while the other idles. Splitting the KV pool by role exposes it:

per-role KV pool: P-pool vs D-pool

Config P-pool steady D-pool steady D-pool peak binding side
8C — single shared pool — 34% 72% none (elastic)
6P+2D 31% 74% 99.6% decode
4P+4D 29% 60% 97.5% decode
2P+6D 92% 95% 96% prefill (P jams first)

peak vs steady utilization

The mechanism, unified:

  • A static P:D split fixes the KV capacity on each side at deploy time.
  • The agentic workload's instantaneous P:D demand drifts (bursts of new sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
  • Whichever side is undersized for the current phase saturates and back-pressures the whole pipeline, while the other side's KV sits stranded.
    • 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled requests queue for a decode slot → TTFT explodes (this is H1).
    • 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even start → 872 queued, 91% dropped.
  • 8C colocation has no partition: prefill and decode share one pool, so the pool elastically reallocates to whichever phase is hot. Steady utilization stays at 34% with 100% completion.

This is H1 (D-pool capacity ceiling) and H2 (static-partition mismatch) turning out to be the same phenomenon seen from two ratios.


6. The routing handicap — and whether smarter routing rescues PD

Every PD config above shows prefix-cache hit = 0%, versus 8C's 19%. That is not fundamental to disaggregation — it is the stock proxy round-robining the prefill side: consecutive turns of one agentic session land on different producers, so each turn re-prefills the whole conversation from scratch. That both inflates TTFT and piles extra load on the prefill pool (directly worsening the 2P+6D collapse).

The correct PD scheduling policy (as the design argues): P should be chosen by session affinity (reuse the producer's prefix cache) while D is chosen by load balance (decode KV is freshly transferred per turn, so D gains nothing from affinity). We added this as an env-gated mode in the proxy (MB5_P_ROUTING=session, consistent hash on X-Session-Id; D stays round-robin) and re-ran the best-performing disaggregated config, 6P+2D.

Status: session-affinity 6P+2D run in progress. Results below will be filled in when it completes; the question it answers is how much of the gap to 8C does restoring prefix-cache reuse close.

(pending)


7. Caveats / honesty

  • Single rep for this analysis. The earlier 3-rep sweep showed 8C and 4P+4D are tight run-to-run, but 6P+2D completion varied (rep1 100% vs rep2 56% vs rep3 80%) — i.e. the D-pool sits right at the cliff edge, so 6P+2D's "100% rep1" is optimistic. The qualitative ranking is robust; exact numbers on the marginal configs are not.
  • Latency percentiles count successes only (see §3 warning). For failing configs the latency bars understate the damage.
  • Round-robin baseline. §6 addresses the routing fairness concern head-on with a session-affinity re-run.
  • Trace is a single agentic workload; conclusions are about this class of workload (sub-second tool-call cadence, multi-turn sessions), not all LLM serving.

8. Reproduce

# from repo root, after microbench/fresh_setup/deploy.sh dash1
# 1. round-robin baseline sweep (1 rep)
ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
    bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'

# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
    --tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
    --reduce-to mb5_runs/reduced_<tag>.json'

# 3. pull the compact JSON, render figures locally
scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
    --from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5

# session-affinity arm: prefix the run with MB5_P_ROUTING=session