Files

Gahow Wang a2111b6e18 PD-disagg docs: annotated corrections for e13391e contamination

Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.

2026-05-31 20:14:14 +08:00

20 KiB

Raw Blame History

PD-disaggregation under an agentic workload — does it work?

Consolidated results doc. Self-contained writeup of every PD-disagg argument and experiment, with figures inline. For the live experiment TODO list see PD_DISAGG_INVESTIGATION.md.

Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct · vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace: w600_r0.0015_st30.jsonl (1214 requests, agentic multi-turn).

⚠️ CORRECTION (2026-05-30) — read before §6

A contamination was found in the "fresh" vLLM used for the runs below. scripts/deploy_vllm_patches.sh had copied our fork commit e13391e over the pip-installed release; that commit calls evict_blocks(sent_block_ids) on finished_sending, i.e. it evicts a producer's prefix-cache blocks on every KV transfer. So a disaggregated producer could never keep a session's prefix warm, regardless of routing. We have since gated that behind VLLM_EVICT_SENT_BLOCKS (default off) and re-run everything on the corrected stack.

Retracted (was a pure artifact of e13391e):

All of §6 ("smarter routing does not save PD" / "session-affinity is strictly worse" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit ~0.2%"). On the corrected stack, session-affinity recovers producer reuse to full parity with colo (APC 71–82%) — the collapse was the eviction bug starving the very cache session-affinity exists to fill, not a routing pathology.
The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix exactly as well as colo once routing is session-sticky.

Still stands (independent of e13391e):

§3 round-robin numbers — RR sends consecutive turns to different producers, so its ~0% prefix-hit is a routing artifact (not the eviction bug) and is reproduced on the clean stack; RR PD still loses to 8C.
§4 PD wins TPOT (decode isolation) — robust.
§5.1 the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
§5 the D-pool capacity-ceiling mechanism (decode side pegs while prefill strands) — real.

Corrected verdict (the real reason PD loses on agentic). It is not "routing can't help." On the clean stack PD is regime-dependent: it wins at low load / decode-heavy / low-reuse, and loses the agentic corner (high reuse + short output + large context + high concurrency) through a structural crossover — its static P:D split cannot simultaneously provide the prefix-cache capacity (needs many producers) and the decode capacity (needs many decoders) that agentic demands at once, while colo's elastic pool provides both. See the three-axis ablation: reuse erodes the edge (1.57×→1.10×), shape rotates the best ratio and is catastrophic at the prefill extreme, and concurrency tips PD at N=64 (APC craters 71%→1.4%, TPS −30%) while colo scales cleanly.

→ Figures: figs/mb5_pd_ablation/ · data: analysis/mb5_pd_ablation/ · the clean re-run of this exact w600 experiment (ratio-swept) is the Fig 4 anchor.

TL;DR (verdict)

No static prefill/decode split beats 8-way colocation (8C) on this agentic workload. Every disaggregated ratio we tried is dominated by 8C on the metric the user actually feels (TTFT, end-to-end latency, request completion), and the failure moves with the ratio:

D-heavy bottleneck (6P+2D, 4P+4D): the decode pool saturates (peak 99.6% / 97.5%) while the prefill pool sits at ~30% — half the cluster's KV is stranded on the wrong side.
P-heavy bottleneck (2P+6D): the 2 prefill instances can't keep up, the prefill pool jams at 99.7%, 872 requests pile up in the queue and 91% of requests never complete.
8C keeps a single elastic pool that absorbs whichever phase is hot at the moment → steady utilization 34%, 100% completion, fastest wall-clock, best p50/p90 latency.

PD-disagg does deliver the phase-isolation win we predicted in MB1 — its TPOT is 10–35× cleaner — but that win is swamped by TTFT inflation and request loss.

Smarter routing does not save it (§6). We added the "correct" PD policy — session-affinity on the prefill side to recover prefix-cache reuse, load-balance on decode — and swept it across all four ratios. It is strictly worse than round-robin at every ratio (4P+4D: 100% → 36% completion), success decreases as you add decode capacity (59→36→24→19%), and the GPUs sit at ~0% utilization — the cluster stalls on KV-transfer coordination, not compute. Session-affinity reproduces the producer hot-pinning pathology from §3.3.

This is the empirical backing for the paper's claim: agentic workloads have time-varying P:D demand that no static partition can track; colocation wins because its pool is elastic — and no routing knob rescues the static split. (H1 and H2 from the investigation doc, unified by one mechanism.)

1. Why this experiment exists

Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed that on the phase-isolation axis alone, PD-disagg actually wins: it removes prefill→decode interference, and the transfer cost is small relative to the interference it avoids. So "PD-disagg is bad for agentic" could not be argued from phase isolation — we needed a system-level experiment that measures the whole picture (queueing, pool capacity, cache reuse), not just the isolated phase cost.

See analysis/mb1 and analysis/mb2 for that accounting. This doc is the system-level answer.

2. Setup


Configs	`8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split)
PD routing	stock round-robin on both P and D (vLLM official `mooncake_connector_proxy`)
Trace	`w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn
Reps	1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed
KV instrumentation	V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`)

8C is the fair baseline: 8 colocated instances, replayer round-robins across them directly (no proxy). PD configs route through the proxy.

3. Headline result — no PD ratio beats 8C

All numbers are rep1.

Metric	8C	6P+2D	4P+4D	2P+6D
completion	100%	100%	100%	9% 💀
wall-clock (drain trace)	2994 s	3419 s	4171 s	5762 s
prefix-cache hit	19.4%	0%	0%	0%
TTFT mean	18.0 s	44.8 s	70.0 s	106.8 s
TTFT p50	7.0 s	41.0 s	56.4 s	23.6 s
TTFT p90	53.1 s	86.7 s	153.1 s	498 s
E2E p50	10.8 s	44.5 s	59.5 s	26.3 s
E2E p90	83.3 s	91.8 s	157.1 s	499 s

⚠️ Read the percentiles with the completion rate. Latency percentiles are computed over successful requests only. 2P+6D's "p99 = 577 s" covers just the 9% that finished — the other 91% never returned, so its real experience is far worse than any latency bar suggests.

8C wins p50 by 4× and p90 decisively. The only metric where a PD config edges 8C is E2E p99 (6P+2D 148 s vs 8C 194 s) — and that is the flip side of the next result.

4. The duality — PD wins TPOT, loses TTFT

PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no prefill stealing decode steps, inter-token latency is dramatically cleaner.

TPOT	8C	6P+2D	4P+4D	2P+6D
mean	87 ms	11 ms	9 ms	6 ms
p90	230 ms	18 ms	14 ms	8 ms
p99	1129 ms	26 ms	20 ms	12 ms

PD's TPOT p99 is 10–35× lower — once a request reaches a dedicated decode instance it streams without interruption. 8C's 1.1 s TPOT p99 is the chunked-prefill interference tax (decode steps occasionally stalled behind an 8k-token prefill chunk), consistent with MB1.

But the win is local. TTFT inflates 2.5–6× because every request now pays P→D handoff + admission into a smaller, saturated decode pool. For this workload's modest output lengths, TTFT dominates total time, so the TPOT win never pays for itself. This is the cost/benefit imbalance made concrete: phase isolation is real, but it is the wrong thing to optimize when the pool is the binding constraint.

5. Root cause — per-role KV pool occupancy (the kill shot)

The cluster-average KV utilization is misleading and nearly hid the result:

6P+2D and 4P+4D look only ~42–46% utilized on cluster average — yet they have 128–152 requests queued. The average hides that one pool is pegged while the other idles. Splitting the KV pool by role exposes it:

Config	P-pool steady	D-pool steady	D-pool peak	binding side
8C	— single shared pool —	34%	72%	none (elastic)
6P+2D	31%	74%	99.6%	decode
4P+4D	29%	60%	97.5%	decode
2P+6D	92%	95%	96%	prefill (P jams first)

The mechanism, unified:

A static P:D split fixes the KV capacity on each side at deploy time.
The agentic workload's instantaneous P:D demand drifts (bursts of new sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
Whichever side is undersized for the current phase saturates and back-pressures the whole pipeline, while the other side's KV sits stranded.
- 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled requests queue for a decode slot → TTFT explodes (this is H1).
- 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even start → 872 queued, 91% dropped.
8C colocation has no partition: prefill and decode share one pool, so the pool elastically reallocates to whichever phase is hot. Steady utilization stays at 34% with 100% completion.

This is H1 (D-pool capacity ceiling) and H2 (static-partition mismatch) turning out to be the same phenomenon seen from two ratios.

5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility)

D-pool saturation doesn't just slow things down — under this workload it crashes the decode instances. The exact chain, from the 6P+2D consumer logs:

D-pool fills to 97.2% (the capacity ceiling above).
A large request needs its KV pulled to the consumer, but the transfer fails: Mooncake transfer engine returned -1 (observed on a 112,793-token request — agentic sessions have very long multi-turn contexts, and the pool had no room).
kv_load_failure_policy=fail fails that request — by itself recoverable.
But the failure path computes PromptTokenStats.local_cache_hit = num_cached + recomputed − num_external_computed, which goes negative when the external transfer exceeded the scheduler's cached count.
loggers.record() calls Counter.inc(negative) → prometheus_client raises "Counters can only be incremented by non-negative amounts" → the EngineCore dies.
Once the consumer's engine is dead, every subsequent request fails.

The signature is a cliff, not a slope: in the session-routing 6P+2D run, all 80 successes landed in the first ~110 s, then zero of the next ~2,800 s. This same intermittent consumer death is almost certainly why the round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer crashed at different points in each rep.

Two takeaways: (a) PD-disagg under agentic context lengths hits KV-transfer failures that colocation never does (8C never transfers — it prefills and decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one failed request into a total collapse. We patched the counter underflow (instrument_kv_snapshot.py, clamp to ≥ 0) so a transfer failure stays a single failed request, which is required to compare routing arms fairly in §6.

6. The routing handicap — and whether smarter routing rescues PD

🛑 RETRACTED (2026-05-30) — this entire section is an artifact of e13391e. The session-affinity runs below were starved by the producer-eviction bug, so they could never collect prefix-cache reuse. On the corrected stack session-affinity reaches APC parity with colo (71–82%) and does not stall at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a routing pathology — see the CORRECTION banner at the top and figs/mb5_pd_ablation/. Text kept below for the record only.

Every PD config above shows prefix-cache hit = 0%, versus 8C's 19%. That is not fundamental to disaggregation — it is the stock proxy round-robining the prefill side: consecutive turns of one agentic session land on different producers, so each turn re-prefills the whole conversation from scratch. That both inflates TTFT and piles extra load on the prefill pool (directly worsening the 2P+6D collapse).

The correct PD scheduling policy (as the design argues): P should be chosen by session affinity (reuse the producer's prefix cache) while D is chosen by load balance (decode KV is freshly transferred per turn, so D gains nothing from affinity). We added this as an env-gated mode in the proxy (MB5_P_ROUTING=session, consistent hash on X-Session-Id; D stays round-robin) and swept it across all four P:D ratios. All runs below are on the metrics-fixed stack (§5.1 clamp), so consumers no longer crash and failures are genuine KV-transfer/capacity failures — an apples-to-apples comparison of the two routing policies.

6.1 Session-affinity does NOT rescue PD — it makes it worse

Config	rr success	session success	rr TTFT mean	direction
6P+2D	73%	59%	89 s	session worse
4P+4D	100%	36%	71 s	session much worse
3P+5D	—	24%	—	↓
2P+6D	9%*	19%	—	↓

* rr 2P+6D from the original sweep (prefill-bound, 9%).

Two results, both decisive:

At every ratio, session-affinity is worse than round-robin. The most damning point is 4P+4D, where round-robin completes 100% but session-affinity completes only 36%.
Session-affinity success decreases monotonically as you add decode capacity (59% → 36% → 24% → 19% going 6P+2D → 4P+4D → 3P+5D → 2P+6D). Adding D does not help — it hurts. This refutes the natural hypothesis ("session prefill is faster, so it needs more D").

6.2 The smoking gun: GPUs sit at ~0% utilization

During the session-affinity runs the cluster is not compute-bound — it is stalled. Sampled GPU utilization mid-run:

session 3P+5D :  0  0 100  0  0  0  0  0     (1 of 8 GPUs doing anything)
session 2P+6D :  0  0  0  0  0  0  0  0       (entirely idle)

Requests are piling up (transfer failures climbing into the hundreds) while the hardware you paid for does nothing. This is the deepest argument against PD-disagg for this workload: the binding constraint is KV-pool capacity and P→D transfer coordination, not FLOPs. Colocation (8C) keeps every GPU busy because prefill and decode interleave in one elastic pool with no cross-instance handoff.

6.3 Why session-affinity backfires (mechanism)

Session-affinity pins all turns of a session to one producer. Agentic sessions are heavy-tailed (a few very long multi-turn sessions — recall the 112k-token request in §5.1). Sticky routing concentrates those heavy sessions onto individual producers, whose KV pools fill and stall — the same hot-pinning pathology as sticky routing in the colocated study (§3.3), now on the producer side. Round-robin avoids it by spreading each session's turns across producers. With fewer producers (2P+6D), the concentration is worse, which is exactly why success keeps dropping as the ratio shifts D-ward. A failed transfer also pins the producer's KV (it is not freed on kv_load_failure_policy=fail), compounding the stall until the pipeline deadlocks at ~0% utilization.

The per-producer KV-pool timelines make the hot-pinning direct. At the same 4P+4D ratio, round-robin holds all four producers within 1 percentage point of each other (spread 0pp, CV 0.01); session-affinity blows the spread open to 49 percentage points (one producer pegged at ~93% while another sits at 45%, CV 0.25 — a 25× jump in load imbalance):

Producer-side prefix-cache hit in the degraded state is ~0.2% (vs round-robin's ~5%) — session-affinity never even gets to collect the cache-reuse benefit it was supposed to provide, because the producers it concentrates load onto are thrashing.

6.4 Verdict on routing

Neither ratio tuning (§3, no static split beats 8C) nor routing policy (§6, session-affinity is strictly worse and ratio-tuning it only makes it worse) rescues static PD-disaggregation for this agentic workload. The failure is structural: a static prefill/decode partition cannot track time-varying P:D demand, the cross-instance KV handoff adds a capacity-coupled failure mode absent in colocation, and the routing knob that helps colocation (affinity) actively hurts disaggregation (producer hotspots). Colocation wins on completion, latency, and hardware utilization.

7. Caveats / honesty

Single rep for this analysis. The earlier 3-rep round-robin sweep varied for 6P+2D (rep1 100% / rep2 56% / rep3 80%) — but §5.1 showed that variance was the consumer-crash bug, not genuine load behavior. On the metrics-fixed stack, round-robin 6P+2D completes a stable 73% (the unpatched "100% rep1" in §3's table was a lucky no-crash run). 8C and rr 4P+4D are tight run-to-run. The qualitative ranking is robust.
Latency percentiles count successes only (see §3 warning). For failing configs the latency bars understate the damage — and for the session- affinity runs, which stall at ~0% GPU util, the latency of the few survivors is especially unrepresentative.
Routing fairness addressed. §6 tests the "correct" PD routing (session-affinity P + load-balanced D) across all ratios; it does not rescue PD, so the round-robin baseline in §3 is not an unfair handicap on the conclusion.
Session-affinity ratio sweep used near-final partials (runs were stopped once the monotonic-decline trend and 0% GPU util were unambiguous, to save GPU time). Exact final percentages would shift by a few points; the trend and the stall are not in doubt.
Trace is a single agentic workload; conclusions are about this class of workload (sub-second tool-call cadence, multi-turn sessions), not all LLM serving.

8. Reproduce

# from repo root, after microbench/fresh_setup/deploy.sh dash1
# 1. round-robin baseline sweep (1 rep)
ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
    bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'

# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
    --tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
    --reduce-to mb5_runs/reduced_<tag>.json'

# 3. pull the compact JSON, render figures locally
scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
    --from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5

# session-affinity arm: prefix the run with MB5_P_ROUTING=session

20 KiB Raw Blame History Unescape Escape