# PD-disaggregation under an agentic workload — does it work? **Consolidated results doc.** Self-contained writeup of every PD-disagg argument and experiment, with figures inline. For the live experiment TODO list see [PD_DISAGG_INVESTIGATION.md](PD_DISAGG_INVESTIGATION.md). Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct · vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace: `w600_r0.0015_st30.jsonl` (1214 requests, agentic multi-turn). --- ## ⚠️ CORRECTION (2026-05-30) — read before §6 A contamination was found in the "fresh" vLLM used for the runs below. `scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on `finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV transfer**. So a disaggregated producer could never keep a session's prefix warm, *regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS` (default off) and re-run everything on the corrected stack. **Retracted (was a pure artifact of `e13391e`):** - **All of §6** ("smarter routing does not save PD" / "session-affinity is *strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit ~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to full parity with colo (APC 71–82%)** — the collapse was the eviction bug starving the very cache session-affinity exists to fill, not a routing pathology. - The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix *exactly as well as colo* once routing is session-sticky. **Still stands (independent of `e13391e`):** - **§3 round-robin** numbers — RR sends consecutive turns to *different* producers, so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is reproduced on the clean stack; RR PD still loses to 8C. - **§4** PD wins TPOT (decode isolation) — robust. - **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug. - **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill strands) — real. **Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing can't help." On the clean stack PD is **regime-dependent**: it *wins* at low load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse + short output + large context + high concurrency) through a structural crossover — its static P:D split cannot simultaneously provide the prefix-cache capacity (needs many producers) *and* the decode capacity (needs many decoders) that agentic demands at once, while colo's elastic pool provides both. See the three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly. → Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) · data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) · the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor. --- ## TL;DR (verdict) **No static prefill/decode split beats 8-way colocation (8C) on this agentic workload.** Every disaggregated ratio we tried is dominated by 8C on the metric the user actually feels (TTFT, end-to-end latency, request completion), and the failure *moves* with the ratio: - **D-heavy bottleneck** (6P+2D, 4P+4D): the decode pool saturates (peak **99.6% / 97.5%**) while the prefill pool sits at **~30%** — half the cluster's KV is stranded on the wrong side. - **P-heavy bottleneck** (2P+6D): the 2 prefill instances can't keep up, the prefill pool jams at **99.7%**, **872 requests** pile up in the queue and **91% of requests never complete**. - **8C** keeps a single elastic pool that absorbs whichever phase is hot at the moment → steady utilization **34%**, **100% completion**, fastest wall-clock, best p50/p90 latency. PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its **TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation and request loss. **Smarter routing does not save it (§6).** We added the "correct" PD policy — session-affinity on the prefill side to recover prefix-cache reuse, load-balance on decode — and swept it across all four ratios. It is *strictly worse* than round-robin at every ratio (4P+4D: 100% → 36% completion), success *decreases* as you add decode capacity (59→36→24→19%), and the GPUs sit at **~0% utilization** — the cluster stalls on KV-transfer coordination, not compute. Session-affinity reproduces the producer **hot-pinning** pathology from §3.3. This is the empirical backing for the paper's claim: **agentic workloads have time-varying P:D demand that no static partition can track; colocation wins because its pool is elastic — and no routing knob rescues the static split.** (H1 *and* H2 from the investigation doc, unified by one mechanism.) --- ## 1. Why this experiment exists Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed that on the **phase-isolation axis alone**, PD-disagg actually *wins*: it removes prefill→decode interference, and the transfer cost is small relative to the interference it avoids. So "PD-disagg is bad for agentic" could not be argued from phase isolation — we needed a system-level experiment that measures the whole picture (queueing, pool capacity, cache reuse), not just the isolated phase cost. See [analysis/mb1](../../analysis/mb1) and [analysis/mb2](../../analysis/mb2) for that accounting. This doc is the system-level answer. --- ## 2. Setup | | | |---|---| | Configs | `8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split) | | PD routing | stock **round-robin** on both P and D (vLLM official `mooncake_connector_proxy`) | | Trace | `w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn | | Reps | 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed | | KV instrumentation | V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`) | 8C is the fair baseline: 8 colocated instances, replayer round-robins across them directly (no proxy). PD configs route through the proxy. --- ## 3. Headline result — no PD ratio beats 8C All numbers are rep1. | Metric | **8C** | 6P+2D | 4P+4D | 2P+6D | |---|---|---|---|---| | **completion** | **100%** | 100% | 100% | **9%** 💀 | | wall-clock (drain trace) | **2994 s** | 3419 s | 4171 s | 5762 s | | prefix-cache hit | **19.4%** | 0% | 0% | 0% | | TTFT mean | **18.0 s** | 44.8 s | 70.0 s | 106.8 s | | TTFT p50 | **7.0 s** | 41.0 s | 56.4 s | 23.6 s | | TTFT p90 | **53.1 s** | 86.7 s | 153.1 s | 498 s | | E2E p50 | **10.8 s** | 44.5 s | 59.5 s | 26.3 s | | E2E p90 | **83.3 s** | 91.8 s | 157.1 s | 499 s | ![e2e latency by config](../../figs/mb5/mb5_latency_compare.png) > ⚠️ **Read the percentiles with the completion rate.** Latency percentiles > are computed over *successful* requests only. 2P+6D's "p99 = 577 s" covers > just the 9% that finished — the other 91% never returned, so its real > experience is far worse than any latency bar suggests. 8C wins p50 by **4×** and p90 decisively. The only metric where a PD config edges 8C is E2E **p99** (6P+2D 148 s vs 8C 194 s) — and that is the flip side of the next result. --- ## 4. The duality — PD wins TPOT, loses TTFT PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no prefill stealing decode steps, **inter-token latency is dramatically cleaner.** | TPOT | **8C** | 6P+2D | 4P+4D | 2P+6D | |---|---|---|---|---| | mean | 87 ms | 11 ms | 9 ms | 6 ms | | p90 | 230 ms | 18 ms | 14 ms | 8 ms | | p99 | **1129 ms** | **26 ms** | **20 ms** | **12 ms** | PD's TPOT p99 is **10–35× lower** — once a request reaches a dedicated decode instance it streams without interruption. 8C's 1.1 s TPOT p99 *is* the chunked-prefill interference tax (decode steps occasionally stalled behind an 8k-token prefill chunk), consistent with MB1. **But the win is local.** TTFT inflates 2.5–6× because every request now pays P→D handoff + admission into a smaller, saturated decode pool. For this workload's modest output lengths, TTFT dominates total time, so the TPOT win never pays for itself. This is the cost/benefit imbalance made concrete: phase isolation is real, but it is the wrong thing to optimize when the pool is the binding constraint. --- ## 5. Root cause — per-role KV pool occupancy (the kill shot) The cluster-average KV utilization is *misleading* and nearly hid the result: ![cluster KV timeline](../../figs/mb5/mb5_kv_timeline.png) 6P+2D and 4P+4D look only ~42–46% utilized on cluster average — yet they have 128–152 requests queued. The average hides that **one pool is pegged while the other idles.** Splitting the KV pool by role exposes it: ![per-role KV pool: P-pool vs D-pool](../../figs/mb5/mb5_role_split.png) | Config | P-pool steady | D-pool steady | D-pool **peak** | binding side | |---|---|---|---|---| | 8C | — single shared pool — | 34% | 72% | none (elastic) | | 6P+2D | 31% | **74%** | **99.6%** | **decode** | | 4P+4D | 29% | **60%** | **97.5%** | **decode** | | 2P+6D | **92%** | 95% | 96% | **prefill** (P jams first) | ![peak vs steady utilization](../../figs/mb5/mb5_peak_utilization.png) **The mechanism, unified:** - A static P:D split fixes the KV capacity on each side at deploy time. - The agentic workload's instantaneous P:D demand *drifts* (bursts of new sessions = prefill-heavy; long tool-call-driven turns = decode-heavy). - Whichever side is undersized *for the current phase* saturates and back-pressures the whole pipeline, while the other side's KV sits stranded. - 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled requests queue for a decode slot → TTFT explodes (this is **H1**). - 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even start → 872 queued, 91% dropped. - **8C colocation has no partition**: prefill and decode share one pool, so the pool elastically reallocates to whichever phase is hot. Steady utilization stays at 34% with 100% completion. This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition mismatch)** turning out to be the *same* phenomenon seen from two ratios. ### 5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility) D-pool saturation doesn't just slow things down — under this workload it **crashes the decode instances**. The exact chain, from the 6P+2D consumer logs: 1. D-pool fills to **97.2%** (the capacity ceiling above). 2. A large request needs its KV pulled to the consumer, but the transfer fails: `Mooncake transfer engine returned -1` (observed on a **112,793-token** request — agentic sessions have very long multi-turn contexts, and the pool had no room). 3. `kv_load_failure_policy=fail` fails that request — by itself recoverable. 4. **But** the failure path computes `PromptTokenStats.local_cache_hit = num_cached + recomputed − num_external_computed`, which goes **negative** when the external transfer exceeded the scheduler's cached count. 5. `loggers.record()` calls `Counter.inc(negative)` → prometheus_client raises *"Counters can only be incremented by non-negative amounts"* → the **EngineCore dies**. 6. Once the consumer's engine is dead, **every** subsequent request fails. The signature is a cliff, not a slope: in the session-routing 6P+2D run, all 80 successes landed in the first ~110 s, then **zero** of the next ~2,800 s. This same intermittent consumer death is almost certainly why the round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer crashed at different points in each rep. **Two takeaways:** (a) PD-disagg under agentic context lengths hits KV-transfer failures that colocation never does (8C never transfers — it prefills and decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one failed request into a total collapse. We patched the counter underflow (`instrument_kv_snapshot.py`, clamp to ≥ 0) so a transfer failure stays a single failed request, which is required to compare routing arms fairly in §6. --- ## 6. The routing handicap — and whether smarter routing rescues PD > 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.** > The session-affinity runs below were starved by the producer-eviction bug, so > they could never collect prefix-cache reuse. On the corrected stack > session-affinity reaches **APC parity with colo (71–82%)** and does *not* stall > at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a > routing pathology — see the CORRECTION banner at the top and > [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the > record only. Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That is not fundamental to disaggregation — it is the stock proxy round-robining the **prefill** side: consecutive turns of one agentic session land on *different* producers, so each turn re-prefills the whole conversation from scratch. That both inflates TTFT and piles extra load on the prefill pool (directly worsening the 2P+6D collapse). The correct PD scheduling policy (as the design argues): **P should be chosen by session affinity** (reuse the producer's prefix cache) while **D is chosen by load balance** (decode KV is freshly transferred per turn, so D gains nothing from affinity). We added this as an env-gated mode in the proxy (`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays round-robin) and swept it across **all four P:D ratios**. All runs below are on the **metrics-fixed stack** (§5.1 clamp), so consumers no longer crash and failures are genuine KV-transfer/capacity failures — an apples-to-apples comparison of the two routing policies. ### 6.1 Session-affinity does NOT rescue PD — it makes it worse | Config | rr success | **session success** | rr TTFT mean | direction | |---|---|---|---|---| | 6P+2D | 73% | **59%** | 89 s | session worse | | 4P+4D | **100%** | **36%** | 71 s | session much worse | | 3P+5D | — | **24%** | — | ↓ | | 2P+6D | 9%* | **19%** | — | ↓ | \* rr 2P+6D from the original sweep (prefill-bound, 9%). Two results, both decisive: 1. **At every ratio, session-affinity is worse than round-robin.** The most damning point is 4P+4D, where round-robin completes **100%** but session-affinity completes only **36%**. 2. **Session-affinity success *decreases monotonically* as you add decode capacity** (59% → 36% → 24% → 19% going 6P+2D → 4P+4D → 3P+5D → 2P+6D). Adding D does not help — it hurts. This refutes the natural hypothesis ("session prefill is faster, so it needs more D"). ### 6.2 The smoking gun: GPUs sit at ~0% utilization During the session-affinity runs the cluster is **not compute-bound — it is stalled**. Sampled GPU utilization mid-run: ``` session 3P+5D : 0 0 100 0 0 0 0 0 (1 of 8 GPUs doing anything) session 2P+6D : 0 0 0 0 0 0 0 0 (entirely idle) ``` Requests are piling up (transfer failures climbing into the hundreds) while **the hardware you paid for does nothing.** This is the deepest argument against PD-disagg for this workload: the binding constraint is KV-pool capacity and P→D transfer coordination, not FLOPs. Colocation (8C) keeps every GPU busy because prefill and decode interleave in one elastic pool with no cross-instance handoff. ### 6.3 Why session-affinity backfires (mechanism) Session-affinity pins **all turns of a session to one producer**. Agentic sessions are heavy-tailed (a few very long multi-turn sessions — recall the 112k-token request in §5.1). Sticky routing concentrates those heavy sessions onto individual producers, whose KV pools fill and stall — the **same hot-pinning pathology as sticky routing in the colocated study (§3.3)**, now on the producer side. Round-robin avoids it by spreading each session's turns across producers. With *fewer* producers (2P+6D), the concentration is worse, which is exactly why success keeps dropping as the ratio shifts D-ward. A failed transfer also pins the producer's KV (it is not freed on `kv_load_failure_policy=fail`), compounding the stall until the pipeline deadlocks at ~0% utilization. The per-producer KV-pool timelines make the hot-pinning direct. At the **same 4P+4D ratio**, round-robin holds all four producers within **1 percentage point** of each other (spread 0pp, CV 0.01); session-affinity blows the spread open to **49 percentage points** (one producer pegged at ~93% while another sits at 45%, CV 0.25 — a 25× jump in load imbalance): ![per-producer KV pool: round-robin vs session-affinity](../../figs/mb5/mb5_producer_hotspot.png) Producer-side prefix-cache hit in the degraded state is ~0.2% (vs round-robin's ~5%) — session-affinity never even gets to *collect* the cache-reuse benefit it was supposed to provide, because the producers it concentrates load onto are thrashing. ### 6.4 Verdict on routing Neither **ratio tuning** (§3, no static split beats 8C) nor **routing policy** (§6, session-affinity is strictly worse and ratio-tuning it only makes it worse) rescues static PD-disaggregation for this agentic workload. The failure is **structural**: a static prefill/decode partition cannot track time-varying P:D demand, the cross-instance KV handoff adds a capacity-coupled failure mode absent in colocation, and the routing knob that helps colocation (affinity) actively hurts disaggregation (producer hotspots). Colocation wins on completion, latency, *and* hardware utilization. --- ## 7. Caveats / honesty - **Single rep** for this analysis. The earlier 3-rep round-robin sweep varied for 6P+2D (rep1 100% / rep2 56% / rep3 80%) — but §5.1 showed that variance was the *consumer-crash bug*, not genuine load behavior. On the metrics-fixed stack, round-robin 6P+2D completes a stable **73%** (the unpatched "100% rep1" in §3's table was a lucky no-crash run). 8C and rr 4P+4D are tight run-to-run. The qualitative ranking is robust. - **Latency percentiles count successes only** (see §3 warning). For failing configs the latency bars *understate* the damage — and for the session- affinity runs, which stall at ~0% GPU util, the latency of the few survivors is especially unrepresentative. - **Routing fairness addressed.** §6 tests the "correct" PD routing (session-affinity P + load-balanced D) across all ratios; it does not rescue PD, so the round-robin baseline in §3 is not an unfair handicap on the conclusion. - **Session-affinity ratio sweep used near-final partials** (runs were stopped once the monotonic-decline trend and 0% GPU util were unambiguous, to save GPU time). Exact final percentages would shift by a few points; the trend and the stall are not in doubt. - Trace is a single agentic workload; conclusions are about *this* class of workload (sub-second tool-call cadence, multi-turn sessions), not all LLM serving. --- ## 8. Reproduce ```bash # from repo root, after microbench/fresh_setup/deploy.sh dash1 # 1. round-robin baseline sweep (1 rep) ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG= \ bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh' # 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs) ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \ --tag --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \ --reduce-to mb5_runs/reduced_.json' # 3. pull the compact JSON, render figures locally scp dash1:.../mb5_runs/reduced_.json analysis/mb5/ .venv/bin/python microbench/fresh_setup/aggregate_mb5.py \ --from-reduced analysis/mb5/reduced_.json --out-dir figs/mb5 # session-affinity arm: prefix the run with MB5_P_ROUTING=session ```