Files

Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 22:04:49 +08:00

9.1 KiB

Raw Blame History

MB1 — Prefill–Decode Interference (chunked-prefill on, vLLM 0.18.1 default)

Persistent record of the phase-interference microbench used to put a quantitative upper bound on what PD-disaggregation can buy under the chunked-prefill-on baseline. Re-runs append a dated section at the bottom; the Summary block is what gets cited.

Summary (latest)

Headline	Value
Baseline single-stream TPOT (D=1, idle GPU)	4.8 ms
Effective per-stream TPOT during 8k-token prefill burst (D=8)	114 ms (≈15× baseline)
Effective per-stream TPOT during 32k-token prefill burst (D=8)	388 ms (≈52×)
Effective per-stream TPOT during 131k-token prefill burst (D=8)	1419 ms (≈183×)

What MB1 actually measures:

During a prefill burst, every ongoing decode stream is essentially halted (per-stream effective TPOT is 15×–2000× baseline, scaling with prefill size). The total decode time lost per prefill event is D × T_prefill (D concurrent decodes each lose ~T_prefill of useful work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8 that's ~36 seconds of decode-equivalent work lost per request. This is the upper bound on what PD-disaggregation's phase isolation could recover on the decode side.

⚠ Correction (2026-05-27): an earlier version of this README framed the §3.2 PD-disagg argument as "phase-isolation benefit is capped at the decode duration of the new request (~50–200 ms), so MB2 transfer cost dominates". That framing was wrong. The correct accounting is benefit-per-prefill-event = D × T_prefill (aggregate decode time saved across all stalled streams), which is much larger than per-request transfer cost. The actual reason static PD-disagg fails in agentic is D-side KV pool capacity (figs/f4b_pdsep_kv_wall.png), not a cost-vs-benefit imbalance on phase isolation. See RESULTS_SUMMARY.md section 4 for the corrected framing.

Setup

Component	Value
Host	dash1, H20 96 GiB, driver 570.133.20
Venv	`/home/admin/cpfs/wjh/agentic-kv-fresh/.venv`
vLLM	0.18.1 official wheel (chunked-prefill default-on, V1 engine)
Model	`/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`
Launch flags	`--tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192`
kv_connector	none (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2)

Method

Adapted from microbench/interference/driver.py:

Start D streaming decode requests on /v1/chat/completions with a long max_tokens cap. Discard the first 32 tokens as warmup.
After 1 s, inject one prefill-only request with max_tokens=1 and an input of P synthetic tokens (uuid-seeded for zero prefix-cache reuse). Measure the prefill's TTFT.
Bin the during-prefill tokens from each decode stream by whether their wall-clock falls inside [prefill_inject_ts, prefill_inject_ts + prefill_ttft]. Report inter-token p50 / p90.
Bin a baseline run (D streams, no prefill injection) the same way.

We additionally compute the effective per-stream TPOT during the prefill burst as the single most informative summary:

eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)

This is the average rate at which each decode stream produces tokens while a prefill is in flight. Compared to baseline TPOT it gives the real per-stream throughput penalty (chunked-prefill p50 looks deceptively fine because most decode-token intervals during the burst are at normal speed; p90 sees the stall but is itself noisy; the effective TPOT is the cleanest "average over the whole burst window" number).

Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192

3 D × 5 P × 3 reps. Aggregated by analyze_mb1.py.

D	P (tok)	base TPOT (ms)	prefill_ttft (ms)	per-stream tokens during	effective TPOT during (ms)	penalty	max PD-disagg benefit per stream (ms)
1	2 048	4.79	163	4.0	41	8×	144
1	8 192	4.78	584	5.0	117	24×	560
1	32 768	4.78	4 515	5.0	903	189×	4 491
1	65 536	4.78	15 568	5.3	2 919	610×	15 542
1	131 072	4.78	56 765	5.7	10 017	2 094×	56 738
4	2 048	5.62	138	3.9	36	6×	117
4	8 192	6.08	574	4.5	128	21×	547
4	32 768	6.09	4 529	11.9	381	63×	4 457
4	65 536	5.85	15 587	19.8	789	135×	15 471
4	131 072	6.27	56 697	37.4	1 517	242×	56 463
8	2 048	7.71	143	4.5	32	4×	109
8	8 192	7.69	583	5.1	114	15×	544
8	32 768	7.42	4 520	11.7	387	52×	4 434
8	65 536	7.67	15 615	20.6	757	99×	15 457
8	131 072	7.74	56 991	40.2	1 419	183×	56 680

Reading the table:

Baseline TPOT grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8). Multi-stream decoding has small but nonzero contention even without prefill.
Effective TPOT during grows mostly with P: a single 8k prefill stalls decode for ~580 ms regardless of D, so each stream emits only a handful of tokens during that 580 ms window — effective per-stream TPOT collapses to 100–130 ms. Larger prefill = more chunks = larger stall.
Penalty is the eff/baseline ratio. Above 50× for P ≥ 32k. Above 500× for D=1 at P ≥ 65k.
Max PD-disagg benefit per stream = prefill_ttft − per_stream_tokens × baseline_TPOT ≈ prefill_ttft (since interference essentially halts decode). This is the entire prefill duration's worth of decode time that could in principle be recovered.

Connecting to the §3.2 PD-disagg argument (corrected):

PD-disagg's promised phase-isolation benefit is per prefill event, not per request. When a new prefill arrives, it stalls every concurrent decode stream on the same GPU. The aggregate decode time lost across those D streams is D × T_prefill. PD-disagg moving prefill off-decode-GPU recovers all of it.

Plugging numbers per prefill event:

Prefill size	T_prefill	PD-disagg cost (MB2 T_transfer)	PD-disagg benefit (D=8 × T_prefill)	Ratio
2k tok (trace lower)	0.14 s	8 ms	1.1 s	0.7 %
33k tok (trace mean)	4.5 s	320 ms	36 s	0.9 %
125k tok (~p99)	57 s	1.9 s	456 s	0.4 %

On the phase-isolation axis alone, PD-disagg wins by 100×–250×. The reason static PD-disagg nonetheless fails in agentic is a different failure mode: the D-side KV pool cannot fit p90+ requests (p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's 4P+4D experiment). The structural problem is capacity (see figs/f4b_pdsep_kv_wall.png), not transfer-cost vs phase-isolation trade-off.

Reproduction

# vllm pair-free single-instance launch
ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
  bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
    --host 127.0.0.1 --port 8000 \
    --model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
    --reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'

# pull + analyze
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
    analysis/mb1/summary.csv
.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
  --summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb1.py \
  --mb1 analysis/mb1/breakdown.json \
  --mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
  --mb2-inter analysis/mb2/inter_kvboth_breakdown.json

# teardown
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'

Open questions / next runs

Chunk size sensitivity: this run uses --max-num-batched-tokens 8192. Sarathi-Serve goes smaller (e.g. 1024) and recovers more decode interleaving inside each prefill burst. Worth running chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis.
Higher D: 12, 16 streams to see whether the penalty saturates or keeps shrinking per-stream.
Cross-validate effective_TPOT_during with token-time-series plot: raw per-token timestamps could reveal whether the stall is a few big spikes or many small ones (currently inferred from p50/p90 spread).

Run log

2026-05-27 — dash1 GPU 0, chunk_tokens=8192

3 × 5 × 3 sweep. CSV: analysis/mb1/summary.csv. Per-config JSONs on dash1 at /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/. Figure: figs/mb1_interference.png. The figure figs/pd_cost_vs_benefit.png from the original commit 029821c was based on the wrong "benefit ≤ decode duration" accounting; deleted in the correction commit.

9.1 KiB Raw Blame History Unescape Escape