Files
agentic-kvc/microbench/fresh_setup/PD_DISAGG_RESULTS.md
Gahow Wang a2111b6e18 PD-disagg docs: annotated corrections for e13391e contamination
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).

  PD_DISAGG_RESULTS.md  top CORRECTION banner + §6 RETRACTED marker.
                        §6 (session-affinity hot-pin) was an `e13391e`
                        artifact under controlled concurrency; §3 RR, §4
                        TPOT win, §5 D-pool ceiling, §5.1 consumer crash
                        stand.
  RESULTS_SUMMARY.md    §4 confirm+refine note: clean ablation confirms
                        the D-pool capacity thesis and adds regime-
                        dependence.
  pd_separation_analysis.md  scoped caution: thesis confirmed; flags
                        only reuse-dependent figures for cross-check
                        (this study used a different stack).
  figs/mb5/CORRECTION.md  flags mb5_producer_hotspot.png as retracted;
                        §3 RR and §5 D-pool figures stand.
2026-05-31 20:14:14 +08:00

403 lines
20 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PD-disaggregation under an agentic workload — does it work?
**Consolidated results doc.** Self-contained writeup of every PD-disagg
argument and experiment, with figures inline. For the live experiment TODO
list see [PD_DISAGG_INVESTIGATION.md](PD_DISAGG_INVESTIGATION.md).
Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct
· vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace:
`w600_r0.0015_st30.jsonl` (1214 requests, agentic multi-turn).
---
## ⚠️ CORRECTION (2026-05-30) — read before §6
A contamination was found in the "fresh" vLLM used for the runs below.
`scripts/deploy_vllm_patches.sh` had copied our fork commit **`e13391e`** over the
pip-installed release; that commit calls `evict_blocks(sent_block_ids)` on
`finished_sending`, i.e. it **evicts a producer's prefix-cache blocks on every KV
transfer**. So a disaggregated producer could never keep a session's prefix warm,
*regardless of routing*. We have since gated that behind `VLLM_EVICT_SENT_BLOCKS`
(default off) and re-run everything on the corrected stack.
**Retracted (was a pure artifact of `e13391e`):**
- **All of §6** ("smarter routing does not save PD" / "session-affinity is
*strictly worse*" / "GPUs at ~0%" / "producer hot-pinning" / "producer prefix-hit
~0.2%"). On the corrected stack, **session-affinity recovers producer reuse to
full parity with colo (APC 7182%)** — the collapse was the eviction bug starving
the very cache session-affinity exists to fill, not a routing pathology.
- The framing that PD reuse is "0% / fundamentally broken." PD reuses prefix
*exactly as well as colo* once routing is session-sticky.
**Still stands (independent of `e13391e`):**
- **§3 round-robin** numbers — RR sends consecutive turns to *different* producers,
so its ~0% prefix-hit is a **routing** artifact (not the eviction bug) and is
reproduced on the clean stack; RR PD still loses to 8C.
- **§4** PD wins TPOT (decode isolation) — robust.
- **§5.1** the consumer counter-underflow crash — a real, separate vLLM 0.18.1 bug.
- **§5** the D-pool capacity-ceiling mechanism (decode side pegs while prefill
strands) — real.
**Corrected verdict (the real reason PD loses on agentic).** It is *not* "routing
can't help." On the clean stack PD is **regime-dependent**: it *wins* at low
load / decode-heavy / low-reuse, and loses the **agentic corner** (high reuse +
short output + large context + high concurrency) through a structural crossover —
its static P:D split cannot simultaneously provide the prefix-cache capacity
(needs many producers) *and* the decode capacity (needs many decoders) that
agentic demands at once, while colo's elastic pool provides both. See the
three-axis ablation: **reuse** erodes the edge (1.57×→1.10×), **shape** rotates the
best ratio and is catastrophic at the prefill extreme, and **concurrency** tips PD
at N=64 (APC craters 71%→1.4%, TPS 30%) while colo scales cleanly.
→ Figures: [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/) ·
data: [`analysis/mb5_pd_ablation/`](../../analysis/mb5_pd_ablation/) ·
the clean re-run of *this exact* w600 experiment (ratio-swept) is the Fig 4 anchor.
---
## TL;DR (verdict)
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
workload.** Every disaggregated ratio we tried is dominated by 8C on the
metric the user actually feels (TTFT, end-to-end latency, request
completion), and the failure *moves* with the ratio:
- **D-heavy bottleneck** (6P+2D, 4P+4D): the decode pool saturates (peak
**99.6% / 97.5%**) while the prefill pool sits at **~30%** — half the
cluster's KV is stranded on the wrong side.
- **P-heavy bottleneck** (2P+6D): the 2 prefill instances can't keep up,
the prefill pool jams at **99.7%**, **872 requests** pile up in the queue
and **91% of requests never complete**.
- **8C** keeps a single elastic pool that absorbs whichever phase is hot at
the moment → steady utilization **34%**, **100% completion**, fastest
wall-clock, best p50/p90 latency.
PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
**TPOT is 1035× cleaner** — but that win is swamped by TTFT inflation and
request loss.
**Smarter routing does not save it (§6).** We added the "correct" PD policy —
session-affinity on the prefill side to recover prefix-cache reuse, load-balance
on decode — and swept it across all four ratios. It is *strictly worse* than
round-robin at every ratio (4P+4D: 100% → 36% completion), success *decreases*
as you add decode capacity (59→36→24→19%), and the GPUs sit at **~0%
utilization** — the cluster stalls on KV-transfer coordination, not compute.
Session-affinity reproduces the producer **hot-pinning** pathology from §3.3.
This is the empirical backing for the paper's claim: **agentic workloads
have time-varying P:D demand that no static partition can track; colocation
wins because its pool is elastic — and no routing knob rescues the static
split.** (H1 *and* H2 from the investigation doc, unified by one mechanism.)
---
## 1. Why this experiment exists
Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed
that on the **phase-isolation axis alone**, PD-disagg actually *wins*: it
removes prefill→decode interference, and the transfer cost is small relative
to the interference it avoids. So "PD-disagg is bad for agentic" could not be
argued from phase isolation — we needed a system-level experiment that
measures the whole picture (queueing, pool capacity, cache reuse), not just
the isolated phase cost.
See [analysis/mb1](../../analysis/mb1) and [analysis/mb2](../../analysis/mb2)
for that accounting. This doc is the system-level answer.
---
## 2. Setup
| | |
|---|---|
| Configs | `8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split) |
| PD routing | stock **round-robin** on both P and D (vLLM official `mooncake_connector_proxy`) |
| Trace | `w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn |
| Reps | 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed |
| KV instrumentation | V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`) |
8C is the fair baseline: 8 colocated instances, replayer round-robins across
them directly (no proxy). PD configs route through the proxy.
---
## 3. Headline result — no PD ratio beats 8C
All numbers are rep1.
| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
|---|---|---|---|---|
| **completion** | **100%** | 100% | 100% | **9%** 💀 |
| wall-clock (drain trace) | **2994 s** | 3419 s | 4171 s | 5762 s |
| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
| TTFT mean | **18.0 s** | 44.8 s | 70.0 s | 106.8 s |
| TTFT p50 | **7.0 s** | 41.0 s | 56.4 s | 23.6 s |
| TTFT p90 | **53.1 s** | 86.7 s | 153.1 s | 498 s |
| E2E p50 | **10.8 s** | 44.5 s | 59.5 s | 26.3 s |
| E2E p90 | **83.3 s** | 91.8 s | 157.1 s | 499 s |
![e2e latency by config](../../figs/mb5/mb5_latency_compare.png)
> ⚠️ **Read the percentiles with the completion rate.** Latency percentiles
> are computed over *successful* requests only. 2P+6D's "p99 = 577 s" covers
> just the 9% that finished — the other 91% never returned, so its real
> experience is far worse than any latency bar suggests.
8C wins p50 by **4×** and p90 decisively. The only metric where a PD config
edges 8C is E2E **p99** (6P+2D 148 s vs 8C 194 s) — and that is the flip side
of the next result.
---
## 4. The duality — PD wins TPOT, loses TTFT
PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no
prefill stealing decode steps, **inter-token latency is dramatically cleaner.**
| TPOT | **8C** | 6P+2D | 4P+4D | 2P+6D |
|---|---|---|---|---|
| mean | 87 ms | 11 ms | 9 ms | 6 ms |
| p90 | 230 ms | 18 ms | 14 ms | 8 ms |
| p99 | **1129 ms** | **26 ms** | **20 ms** | **12 ms** |
PD's TPOT p99 is **1035× lower** — once a request reaches a dedicated decode
instance it streams without interruption. 8C's 1.1 s TPOT p99 *is* the
chunked-prefill interference tax (decode steps occasionally stalled behind an
8k-token prefill chunk), consistent with MB1.
**But the win is local.** TTFT inflates 2.56× because every request now pays
P→D handoff + admission into a smaller, saturated decode pool. For this
workload's modest output lengths, TTFT dominates total time, so the TPOT win
never pays for itself. This is the cost/benefit imbalance made concrete:
phase isolation is real, but it is the wrong thing to optimize when the pool
is the binding constraint.
---
## 5. Root cause — per-role KV pool occupancy (the kill shot)
The cluster-average KV utilization is *misleading* and nearly hid the result:
![cluster KV timeline](../../figs/mb5/mb5_kv_timeline.png)
6P+2D and 4P+4D look only ~4246% utilized on cluster average — yet they have
128152 requests queued. The average hides that **one pool is pegged while
the other idles.** Splitting the KV pool by role exposes it:
![per-role KV pool: P-pool vs D-pool](../../figs/mb5/mb5_role_split.png)
| Config | P-pool steady | D-pool steady | D-pool **peak** | binding side |
|---|---|---|---|---|
| 8C | — single shared pool — | 34% | 72% | none (elastic) |
| 6P+2D | 31% | **74%** | **99.6%** | **decode** |
| 4P+4D | 29% | **60%** | **97.5%** | **decode** |
| 2P+6D | **92%** | 95% | 96% | **prefill** (P jams first) |
![peak vs steady utilization](../../figs/mb5/mb5_peak_utilization.png)
**The mechanism, unified:**
- A static P:D split fixes the KV capacity on each side at deploy time.
- The agentic workload's instantaneous P:D demand *drifts* (bursts of new
sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
- Whichever side is undersized *for the current phase* saturates and
back-pressures the whole pipeline, while the other side's KV sits stranded.
- 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled
requests queue for a decode slot → TTFT explodes (this is **H1**).
- 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even
start → 872 queued, 91% dropped.
- **8C colocation has no partition**: prefill and decode share one pool, so
the pool elastically reallocates to whichever phase is hot. Steady
utilization stays at 34% with 100% completion.
This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
mismatch)** turning out to be the *same* phenomenon seen from two ratios.
### 5.1 The same pressure crashes consumers (a vLLM 0.18.1 fragility)
D-pool saturation doesn't just slow things down — under this workload it
**crashes the decode instances**. The exact chain, from the 6P+2D consumer
logs:
1. D-pool fills to **97.2%** (the capacity ceiling above).
2. A large request needs its KV pulled to the consumer, but the transfer
fails: `Mooncake transfer engine returned -1` (observed on a **112,793-token**
request — agentic sessions have very long multi-turn contexts, and the
pool had no room).
3. `kv_load_failure_policy=fail` fails that request — by itself recoverable.
4. **But** the failure path computes `PromptTokenStats.local_cache_hit =
num_cached + recomputed num_external_computed`, which goes **negative**
when the external transfer exceeded the scheduler's cached count.
5. `loggers.record()` calls `Counter.inc(negative)` → prometheus_client raises
*"Counters can only be incremented by non-negative amounts"* → the
**EngineCore dies**.
6. Once the consumer's engine is dead, **every** subsequent request fails.
The signature is a cliff, not a slope: in the session-routing 6P+2D run, all
80 successes landed in the first ~110 s, then **zero** of the next ~2,800 s.
This same intermittent consumer death is almost certainly why the
round-robin 6P+2D reps varied so wildly (100% / 56% / 80%) — the consumer
crashed at different points in each rep.
**Two takeaways:** (a) PD-disagg under agentic context lengths hits KV-transfer
failures that colocation never does (8C never transfers — it prefills and
decodes in the same pool); (b) vLLM 0.18.1's failure handling amplifies one
failed request into a total collapse. We patched the counter underflow
(`instrument_kv_snapshot.py`, clamp to ≥ 0) so a transfer failure stays a
single failed request, which is required to compare routing arms fairly in §6.
---
## 6. The routing handicap — and whether smarter routing rescues PD
> 🛑 **RETRACTED (2026-05-30) — this entire section is an artifact of `e13391e`.**
> The session-affinity runs below were starved by the producer-eviction bug, so
> they could never collect prefix-cache reuse. On the corrected stack
> session-affinity reaches **APC parity with colo (7182%)** and does *not* stall
> at 0% GPU util. The real mechanism is the capacity/concurrency crossover, not a
> routing pathology — see the CORRECTION banner at the top and
> [`figs/mb5_pd_ablation/`](../../figs/mb5_pd_ablation/). Text kept below for the
> record only.
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
is not fundamental to disaggregation — it is the stock proxy round-robining
the **prefill** side: consecutive turns of one agentic session land on
*different* producers, so each turn re-prefills the whole conversation from
scratch. That both inflates TTFT and piles extra load on the prefill pool
(directly worsening the 2P+6D collapse).
The correct PD scheduling policy (as the design argues): **P should be chosen
by session affinity** (reuse the producer's prefix cache) while **D is chosen
by load balance** (decode KV is freshly transferred per turn, so D gains
nothing from affinity). We added this as an env-gated mode in the proxy
(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
round-robin) and swept it across **all four P:D ratios**. All runs below are on
the **metrics-fixed stack** (§5.1 clamp), so consumers no longer crash and
failures are genuine KV-transfer/capacity failures — an apples-to-apples
comparison of the two routing policies.
### 6.1 Session-affinity does NOT rescue PD — it makes it worse
| Config | rr success | **session success** | rr TTFT mean | direction |
|---|---|---|---|---|
| 6P+2D | 73% | **59%** | 89 s | session worse |
| 4P+4D | **100%** | **36%** | 71 s | session much worse |
| 3P+5D | — | **24%** | — | ↓ |
| 2P+6D | 9%* | **19%** | — | ↓ |
\* rr 2P+6D from the original sweep (prefill-bound, 9%).
Two results, both decisive:
1. **At every ratio, session-affinity is worse than round-robin.** The most
damning point is 4P+4D, where round-robin completes **100%** but
session-affinity completes only **36%**.
2. **Session-affinity success *decreases monotonically* as you add decode
capacity** (59% → 36% → 24% → 19% going 6P+2D → 4P+4D → 3P+5D → 2P+6D).
Adding D does not help — it hurts. This refutes the natural hypothesis
("session prefill is faster, so it needs more D").
### 6.2 The smoking gun: GPUs sit at ~0% utilization
During the session-affinity runs the cluster is **not compute-bound — it is
stalled**. Sampled GPU utilization mid-run:
```
session 3P+5D : 0 0 100 0 0 0 0 0 (1 of 8 GPUs doing anything)
session 2P+6D : 0 0 0 0 0 0 0 0 (entirely idle)
```
Requests are piling up (transfer failures climbing into the hundreds) while
**the hardware you paid for does nothing.** This is the deepest argument
against PD-disagg for this workload: the binding constraint is KV-pool
capacity and P→D transfer coordination, not FLOPs. Colocation (8C) keeps every
GPU busy because prefill and decode interleave in one elastic pool with no
cross-instance handoff.
### 6.3 Why session-affinity backfires (mechanism)
Session-affinity pins **all turns of a session to one producer**. Agentic
sessions are heavy-tailed (a few very long multi-turn sessions — recall the
112k-token request in §5.1). Sticky routing concentrates those heavy sessions
onto individual producers, whose KV pools fill and stall — the **same
hot-pinning pathology as sticky routing in the colocated study (§3.3)**, now on
the producer side. Round-robin avoids it by spreading each session's turns
across producers. With *fewer* producers (2P+6D), the concentration is worse,
which is exactly why success keeps dropping as the ratio shifts D-ward. A
failed transfer also pins the producer's KV (it is not freed on
`kv_load_failure_policy=fail`), compounding the stall until the pipeline
deadlocks at ~0% utilization.
The per-producer KV-pool timelines make the hot-pinning direct. At the **same
4P+4D ratio**, round-robin holds all four producers within **1 percentage
point** of each other (spread 0pp, CV 0.01); session-affinity blows the spread
open to **49 percentage points** (one producer pegged at ~93% while another
sits at 45%, CV 0.25 — a 25× jump in load imbalance):
![per-producer KV pool: round-robin vs session-affinity](../../figs/mb5/mb5_producer_hotspot.png)
Producer-side prefix-cache hit in the degraded state is ~0.2% (vs round-robin's
~5%) — session-affinity never even gets to *collect* the cache-reuse benefit it
was supposed to provide, because the producers it concentrates load onto are
thrashing.
### 6.4 Verdict on routing
Neither **ratio tuning** (§3, no static split beats 8C) nor **routing policy**
(§6, session-affinity is strictly worse and ratio-tuning it only makes it
worse) rescues static PD-disaggregation for this agentic workload. The failure
is **structural**: a static prefill/decode partition cannot track time-varying
P:D demand, the cross-instance KV handoff adds a capacity-coupled failure mode
absent in colocation, and the routing knob that helps colocation (affinity)
actively hurts disaggregation (producer hotspots). Colocation wins on
completion, latency, *and* hardware utilization.
---
## 7. Caveats / honesty
- **Single rep** for this analysis. The earlier 3-rep round-robin sweep
varied for 6P+2D (rep1 100% / rep2 56% / rep3 80%) — but §5.1 showed that
variance was the *consumer-crash bug*, not genuine load behavior. On the
metrics-fixed stack, round-robin 6P+2D completes a stable **73%** (the
unpatched "100% rep1" in §3's table was a lucky no-crash run). 8C and rr
4P+4D are tight run-to-run. The qualitative ranking is robust.
- **Latency percentiles count successes only** (see §3 warning). For failing
configs the latency bars *understate* the damage — and for the session-
affinity runs, which stall at ~0% GPU util, the latency of the few survivors
is especially unrepresentative.
- **Routing fairness addressed.** §6 tests the "correct" PD routing
(session-affinity P + load-balanced D) across all ratios; it does not rescue
PD, so the round-robin baseline in §3 is not an unfair handicap on the
conclusion.
- **Session-affinity ratio sweep used near-final partials** (runs were stopped
once the monotonic-decline trend and 0% GPU util were unambiguous, to save
GPU time). Exact final percentages would shift by a few points; the trend
and the stall are not in doubt.
- Trace is a single agentic workload; conclusions are about *this* class of
workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
serving.
---
## 8. Reproduce
```bash
# from repo root, after microbench/fresh_setup/deploy.sh dash1
# 1. round-robin baseline sweep (1 rep)
ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'
# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
--tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
--reduce-to mb5_runs/reduced_<tag>.json'
# 3. pull the compact JSON, render figures locally
scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
--from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5
# session-affinity arm: prefix the run with MB5_P_ROUTING=session
```