agentic-kvc/microbench/fresh_setup/PD_DISAGG_RESULTS.md

# PD-disaggregation under an agentic workload — does it work?

**Consolidated results doc.** Self-contained writeup of every PD-disagg
argument and experiment, with figures inline. For the live experiment TODO
list see [PD_DISAGG_INVESTIGATION.md](PD_DISAGG_INVESTIGATION.md).

Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct
· vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace:
`w600_r0.0015_st30.jsonl` (1214 requests, agentic multi-turn).

---

## ⚠️ CORRECTION (2026-05-30) — read before §6

A contamination was found in the "fresh" vLLM used for the runs below.
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
transfer**. So a disaggregated producer could never keep a session's prefix warm,
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
(default off) and re-run everything on the corrected stack.

**Retracted (was a pure artifact of `e13391e`):**
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
  *strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
  ~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
  full parity with colo (APC 71–82%)** — the collapse was the eviction bug starving
  the very cache session-affinity exists to fill, not a routing pathology.
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
  *exactly as well as colo* once routing is session-sticky.

**Still stands (independent of `e13391e`):**
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
  so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
  reproduced on the clean stack; RR PD still loses to 8C.
- **§4** PD wins TPOT (decode isolation) — robust.
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
  strands) — real.

**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
short output + large context + high concurrency) through a structural crossover —
its static P:D split cannot simultaneously provide the prefix-cache capacity
(needs many producers) *and* the decode capacity (needs many decoders) that
agentic demands at once, while colo's elastic pool provides both. See the
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly.

→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.

---

## TL;DR (verdict)

**No static prefill/decode split beats 8-way colocation (8C) on this agentic
workload.** Every disaggregated ratio we tried is dominated by 8C on the
metric the user actually feels (TTFT, end-to-end latency, request
completion), and the failure *moves* with the ratio:

- **D-heavy bottleneck** (6P+2D, 4P+4D): the decode pool saturates (peak
  **99.6% / 97.5%**) while the prefill pool sits at **~30%** — half the
  cluster's KV is stranded on the wrong side.
- **P-heavy bottleneck** (2P+6D): the 2 prefill instances can't keep up,
  the prefill pool jams at **99.7%**, **872 requests** pile up in the queue
  and **91% of requests never complete**.
- **8C** keeps a single elastic pool that absorbs whichever phase is hot at
  the moment → steady utilization **34%**, **100% completion**, fastest
  wall-clock, best p50/p90 latency.

PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
**TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation and
request loss.

**Smarter routing does not save it (§6).** We added the "correct" PD policy —
session-affinity on the prefill side to recover prefix-cache reuse, load-balance
on decode — and swept it across all four ratios. It is *strictly worse* than
round-robin at every ratio (4P+4D: 100% → 36% completion), success *decreases*
as you add decode capacity (59→36→24→19%), and the GPUs sit at **~0%
utilization** — the cluster stalls on KV-transfer coordination, not compute.
Session-affinity reproduces the producer **hot-pinning** pathology from §3.3.

This is the empirical backing for the paper's claim: **agentic workloads
have time-varying P:D demand that no static partition can track; colocation
wins because its pool is elastic — and no routing knob rescues the static
split.** (H1 *and* H2 from the investigation doc, unified by one mechanism.)

---

## 1. Why this experiment exists

Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed
that on the **phase-isolation axis alone**, PD-disagg actually *wins*: it
removes prefill→decode interference, and the transfer cost is small relative
to the interference it avoids. So "PD-disagg is bad for agentic" could not be
argued from phase isolation — we needed a system-level experiment that
measures the whole picture (queueing, pool capacity, cache reuse), not just
the isolated phase cost.

See [analysis/mb1](../../analysis/mb1) and [analysis/mb2](../../analysis/mb2)
for that accounting. This doc is the system-level answer.

---

## 2. Setup

| | |
|---|---|
| Configs | `8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split) |
| PD routing | stock **round-robin** on both P and D (vLLM official `mooncake_connector_proxy`) |
| Trace | `w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn |
| Reps | 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed |
| KV instrumentation | V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`) |

8C is the fair baseline: 8 colocated instances, replayer round-robins across
them directly (no proxy). PD configs route through the proxy.

---

## 3. Headline result — no PD ratio beats 8C

All numbers are rep1.

| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
|---|---|---|---|---|
| **completion** | **100%** | 100% | 100% | **9%** 💀 |
| wall-clock (drain trace) | **2994 s** | 3419 s | 4171 s | 5762 s |
| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
| TTFT mean | **18.0 s** | 44.8 s | 70.0 s | 106.8 s |
| TTFT p50 | **7.0 s** | 41.0 s | 56.4 s | 23.6 s |
| TTFT p90 | **53.1 s** | 86.7 s | 153.1 s | 498 s |
| E2E p50 | **10.8 s** | 44.5 s | 59.5 s | 26.3 s |
| E2E p90 | **83.3 s** | 91.8 s | 157.1 s | 499 s |

![e2e latency by config](../../figs/mb5/mb5_latency_compare.png)

> ⚠️ **Read the percentiles with the completion rate.** Latency percentiles
> are computed over *successful* requests only. 2P+6D's "p99 = 577 s" covers
> just the 9% that finished — the other 91% never returned, so its real
> experience is far worse than any latency bar suggests.

8C wins p50 by **4×** and p90 decisively. The only metric where a PD config
edges 8C is E2E **p99** (6P+2D 148 s vs 8C 194 s) — and that is the flip side
of the next result.

---

## 4. The duality — PD wins TPOT, loses TTFT

PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no
prefill stealing decode steps, **inter-token latency is dramatically cleaner.**

| TPOT | **8C** | 6P+2D | 4P+4D | 2P+6D |
|---|---|---|---|---|
| mean | 87 ms | 11 ms | 9 ms | 6 ms |
| p90 | 230 ms | 18 ms | 14 ms | 8 ms |
| p99 | **1129 ms** | **26 ms** | **20 ms** | **12 ms** |

PD's TPOT p99 is **10–35× lower** — once a request reaches a dedicated decode
instance it streams without interruption. 8C's 1.1 s TPOT p99 *is* the
chunked-prefill interference tax (decode steps occasionally stalled behind an
8k-token prefill chunk), consistent with MB1.

**But the win is local.** TTFT inflates 2.5–6× because every request now pays
P→D handoff + admission into a smaller, saturated decode pool. For this
workload's modest output lengths, TTFT dominates total time, so the TPOT win
never pays for itself. This is the cost/benefit imbalance made concrete:
phase isolation is real, but it is the wrong thing to optimize when the pool
is the binding constraint.

---

## 5. Root cause — per-role KV pool occupancy (the kill shot)

The cluster-average KV utilization is *misleading* and nearly hid the result:

![cluster KV timeline](../../figs/mb5/mb5_kv_timeline.png)

6P+2D and 4P+4D look only ~42–46% utilized on cluster average — yet they have
128–152 requests queued. The average hides that **one pool is pegged while
the other idles.** Splitting the KV pool by role exposes it:

![per-role KV pool: P-pool vs D-pool](../../figs/mb5/mb5_role_split.png)

| Config | P-pool steady | D-pool steady | D-pool **peak** | binding side |
|---|---|---|---|---|
| 8C | — single shared pool — | 34% | 72% | none (elastic) |
| 6P+2D | 31% | **74%** | **99.6%** | **decode** |
| 4P+4D | 29% | **60%** | **97.5%** | **decode** |
| 2P+6D | **92%** | 95% | 96% | **prefill** (P jams first) |

![peak vs steady utilization](../../figs/mb5/mb5_peak_utilization.png)

**The mechanism, unified:**

- A static P:D split fixes the KV capacity on each side at deploy time.
- The agentic workload's instantaneous P:D demand *drifts* (bursts of new
  sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
- Whichever side is undersized *for the current phase* saturates and
  back-pressures the whole pipeline, while the other side's KV sits stranded.
  - 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled
    requests queue for a decode slot → TTFT explodes (this is **H1**).
  - 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even
    start → 872 queued, 91% dropped.
- **8C colocation has no partition**: prefill and decode share one pool, so
  the pool elastically reallocates to whichever phase is hot. Steady
  utilization stays at 34% with 100% completion.

This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
mismatch)** turning out to be the *same* phenomenon seen from two ratios.

### 5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility)

D-pool saturation doesn't just slow things down — under this workload it
**crashes the decode instances**. The exact chain, from the 6P+2D consumer
logs:

1. D-pool fills to **97.2%** (the capacity ceiling above).
2. A large request needs its KV pulled to the consumer, but the transfer
   fails: `Mooncake transfer engine returned -1` (observed on a **112,793-token**
   request — agentic sessions have very long multi-turn contexts, and the
   pool had no room).
3. `kv_load_failure_policy=fail` fails that request — by itself recoverable.
4. **But** the failure path computes `PromptTokenStats.local_cache_hit =
   num_cached + recomputed − num_external_computed`, which goes **negative**
   when the external transfer exceeded the scheduler's cached count.
5. `loggers.record()` calls `Counter.inc(negative)` → prometheus_client raises
   *"Counters can only be incremented by non-negative amounts"* → the
   **EngineCore dies**.
6. Once the consumer's engine is dead, **every** subsequent request fails.

The signature is a cliff, not a slope: in the session-routing 6P+2D run, all
80 successes landed in the first ~110 s, then **zero** of the next ~2,800 s.
This same intermittent consumer death is almost certainly why the
round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer
crashed at different points in each rep.

**Two takeaways:** (a) PD-disagg under agentic context lengths hits KV-transfer
failures that colocation never does (8C never transfers — it prefills and
decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one
failed request into a total collapse. We patched the counter underflow
(`instrument_kv_snapshot.py`, clamp to ≥ 0) so a transfer failure stays a
single failed request, which is required to compare routing arms fairly in §6.

---

## 6. The routing handicap — and whether smarter routing rescues PD

> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
> The session-affinity runs below were starved by the producer-eviction bug, so
> they could never collect prefix-cache reuse. On the corrected stack
> session-affinity reaches **APC parity with colo (71–82%)** and does *not* stall
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
> routing pathology — see the CORRECTION banner at the top and
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
> record only.

Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
is not fundamental to disaggregation — it is the stock proxy round-robining
the **prefill** side: consecutive turns of one agentic session land on
*different* producers, so each turn re-prefills the whole conversation from
scratch. That both inflates TTFT and piles extra load on the prefill pool
(directly worsening the 2P+6D collapse).

The correct PD scheduling policy (as the design argues): **P should be chosen
by session affinity** (reuse the producer's prefix cache) while **D is chosen
by load balance** (decode KV is freshly transferred per turn, so D gains
nothing from affinity). We added this as an env-gated mode in the proxy
(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
round-robin) and swept it across **all four P:D ratios**. All runs below are on
the **metrics-fixed stack** (§5.1 clamp), so consumers no longer crash and
failures are genuine KV-transfer/capacity failures — an apples-to-apples
comparison of the two routing policies.

### 6.1 Session-affinity does NOT rescue PD — it makes it worse

| Config | rr success | **session success** | rr TTFT mean | direction |
|---|---|---|---|---|
| 6P+2D | 73% | **59%** | 89 s | session worse |
| 4P+4D | **100%** | **36%** | 71 s | session much worse |
| 3P+5D | — | **24%** | — | ↓ |
| 2P+6D | 9%* | **19%** | — | ↓ |

\* rr 2P+6D from the original sweep (prefill-bound, 9%).

Two results, both decisive:

1. **At every ratio, session-affinity is worse than round-robin.** The most
   damning point is 4P+4D, where round-robin completes **100%** but
   session-affinity completes only **36%**.
2. **Session-affinity success *decreases monotonically* as you add decode
   capacity** (59% → 36% → 24% → 19% going 6P+2D → 4P+4D → 3P+5D → 2P+6D).
   Adding D does not help — it hurts. This refutes the natural hypothesis
   ("session prefill is faster, so it needs more D").

### 6.2 The smoking gun: GPUs sit at ~0% utilization

During the session-affinity runs the cluster is **not compute-bound — it is
stalled**. Sampled GPU utilization mid-run:

```
session 3P+5D :  0  0 100  0  0  0  0  0     (1 of 8 GPUs doing anything)
session 2P+6D :  0  0  0  0  0  0  0  0       (entirely idle)
```

Requests are piling up (transfer failures climbing into the hundreds) while
**the hardware you paid for does nothing.** This is the deepest argument
against PD-disagg for this workload: the binding constraint is KV-pool
capacity and P→D transfer coordination, not FLOPs. Colocation (8C) keeps every
GPU busy because prefill and decode interleave in one elastic pool with no
cross-instance handoff.

### 6.3 Why session-affinity backfires (mechanism)

Session-affinity pins **all turns of a session to one producer**. Agentic
sessions are heavy-tailed (a few very long multi-turn sessions — recall the
112k-token request in §5.1). Sticky routing concentrates those heavy sessions
onto individual producers, whose KV pools fill and stall — the **same
hot-pinning pathology as sticky routing in the colocated study (§3.3)**, now on
the producer side. Round-robin avoids it by spreading each session's turns
across producers. With *fewer* producers (2P+6D), the concentration is worse,
which is exactly why success keeps dropping as the ratio shifts D-ward. A
failed transfer also pins the producer's KV (it is not freed on
`kv_load_failure_policy=fail`), compounding the stall until the pipeline
deadlocks at ~0% utilization.

The per-producer KV-pool timelines make the hot-pinning direct. At the **same
4P+4D ratio**, round-robin holds all four producers within **1 percentage
point** of each other (spread 0pp, CV 0.01); session-affinity blows the spread
open to **49 percentage points** (one producer pegged at ~93% while another
sits at 45%, CV 0.25 — a 25× jump in load imbalance):

![per-producer KV pool: round-robin vs session-affinity](../../figs/mb5/mb5_producer_hotspot.png)

Producer-side prefix-cache hit in the degraded state is ~0.2% (vs round-robin's
~5%) — session-affinity never even gets to *collect* the cache-reuse benefit it
was supposed to provide, because the producers it concentrates load onto are
thrashing.

### 6.4 Verdict on routing

Neither **ratio tuning** (§3, no static split beats 8C) nor **routing policy**
(§6, session-affinity is strictly worse and ratio-tuning it only makes it
worse) rescues static PD-disaggregation for this agentic workload. The failure
is **structural**: a static prefill/decode partition cannot track time-varying
P:D demand, the cross-instance KV handoff adds a capacity-coupled failure mode
absent in colocation, and the routing knob that helps colocation (affinity)
actively hurts disaggregation (producer hotspots). Colocation wins on
completion, latency, *and* hardware utilization.

---

## 7. Caveats / honesty

- **Single rep** for this analysis. The earlier 3-rep round-robin sweep
  varied for 6P+2D (rep1 100% / rep2 56% / rep3 80%) — but §5.1 showed that
  variance was the *consumer-crash bug*, not genuine load behavior. On the
  metrics-fixed stack, round-robin 6P+2D completes a stable **73%** (the
  unpatched "100% rep1" in §3's table was a lucky no-crash run). 8C and rr
  4P+4D are tight run-to-run. The qualitative ranking is robust.
- **Latency percentiles count successes only** (see §3 warning). For failing
  configs the latency bars *understate* the damage — and for the session-
  affinity runs, which stall at ~0% GPU util, the latency of the few survivors
  is especially unrepresentative.
- **Routing fairness addressed.** §6 tests the "correct" PD routing
  (session-affinity P + load-balanced D) across all ratios; it does not rescue
  PD, so the round-robin baseline in §3 is not an unfair handicap on the
  conclusion.
- **Session-affinity ratio sweep used near-final partials** (runs were stopped
  once the monotonic-decline trend and 0% GPU util were unambiguous, to save
  GPU time). Exact final percentages would shift by a few points; the trend
  and the stall are not in doubt.
- Trace is a single agentic workload; conclusions are about *this* class of
  workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
  serving.

---

## 8. Reproduce

```bash
# from repo root, after microbench/fresh_setup/deploy.sh dash1
# 1. round-robin baseline sweep (1 rep)
ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
    bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'

# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
    --tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
    --reduce-to mb5_runs/reduced_<tag>.json'

# 3. pull the compact JSON, render figures locally
scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
    --from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5

# session-affinity arm: prefix the run with MB5_P_ROUTING=session
```