The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.1 KiB
MB1 — Prefill–Decode Interference (chunked-prefill on, vLLM 0.18.1 default)
Persistent record of the phase-interference microbench used to put a quantitative upper bound on what PD-disaggregation can buy under the chunked-prefill-on baseline. Re-runs append a dated section at the bottom; the Summary block is what gets cited.
Summary (latest)
| Headline | Value |
|---|---|
| Baseline single-stream TPOT (D=1, idle GPU) | 4.8 ms |
| Effective per-stream TPOT during 8k-token prefill burst (D=8) | 114 ms (≈15× baseline) |
| Effective per-stream TPOT during 32k-token prefill burst (D=8) | 388 ms (≈52×) |
| Effective per-stream TPOT during 131k-token prefill burst (D=8) | 1419 ms (≈183×) |
What MB1 actually measures:
During a prefill burst, every ongoing decode stream is essentially halted (per-stream effective TPOT is 15×–2000× baseline, scaling with prefill size). The total decode time lost per prefill event is
D × T_prefill(D concurrent decodes each lose ~T_prefill of useful work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8 that's ~36 seconds of decode-equivalent work lost per request. This is the upper bound on what PD-disaggregation's phase isolation could recover on the decode side.
⚠ Correction (2026-05-27): an earlier version of this README framed
the §3.2 PD-disagg argument as "phase-isolation benefit is capped at
the decode duration of the new request (~50–200 ms), so MB2 transfer
cost dominates". That framing was wrong. The correct accounting is
benefit-per-prefill-event = D × T_prefill (aggregate decode time saved
across all stalled streams), which is much larger than per-request
transfer cost. The actual reason static PD-disagg fails in agentic
is D-side KV pool capacity (figs/f4b_pdsep_kv_wall.png), not a
cost-vs-benefit imbalance on phase isolation. See RESULTS_SUMMARY.md
section 4 for the corrected framing.
Setup
| Component | Value |
|---|---|
| Host | dash1, H20 96 GiB, driver 570.133.20 |
| Venv | /home/admin/cpfs/wjh/agentic-kv-fresh/.venv |
| vLLM | 0.18.1 official wheel (chunked-prefill default-on, V1 engine) |
| Model | /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct |
| Launch flags | --tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192 |
| kv_connector | none (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2) |
Method
Adapted from microbench/interference/driver.py:
- Start D streaming decode requests on
/v1/chat/completionswith a long max_tokens cap. Discard the first 32 tokens as warmup. - After 1 s, inject one prefill-only request with
max_tokens=1and an input ofPsynthetic tokens (uuid-seeded for zero prefix-cache reuse). Measure the prefill's TTFT. - Bin the during-prefill tokens from each decode stream by whether
their wall-clock falls inside
[prefill_inject_ts, prefill_inject_ts + prefill_ttft]. Report inter-token p50 / p90. - Bin a baseline run (D streams, no prefill injection) the same way.
We additionally compute the effective per-stream TPOT during the prefill burst as the single most informative summary:
eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
This is the average rate at which each decode stream produces tokens while a prefill is in flight. Compared to baseline TPOT it gives the real per-stream throughput penalty (chunked-prefill p50 looks deceptively fine because most decode-token intervals during the burst are at normal speed; p90 sees the stall but is itself noisy; the effective TPOT is the cleanest "average over the whole burst window" number).
Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192
3 D × 5 P × 3 reps. Aggregated by analyze_mb1.py.
| D | P (tok) | base TPOT (ms) | prefill_ttft (ms) | per-stream tokens during | effective TPOT during (ms) | penalty | max PD-disagg benefit per stream (ms) |
|---|---|---|---|---|---|---|---|
| 1 | 2 048 | 4.79 | 163 | 4.0 | 41 | 8× | 144 |
| 1 | 8 192 | 4.78 | 584 | 5.0 | 117 | 24× | 560 |
| 1 | 32 768 | 4.78 | 4 515 | 5.0 | 903 | 189× | 4 491 |
| 1 | 65 536 | 4.78 | 15 568 | 5.3 | 2 919 | 610× | 15 542 |
| 1 | 131 072 | 4.78 | 56 765 | 5.7 | 10 017 | 2 094× | 56 738 |
| 4 | 2 048 | 5.62 | 138 | 3.9 | 36 | 6× | 117 |
| 4 | 8 192 | 6.08 | 574 | 4.5 | 128 | 21× | 547 |
| 4 | 32 768 | 6.09 | 4 529 | 11.9 | 381 | 63× | 4 457 |
| 4 | 65 536 | 5.85 | 15 587 | 19.8 | 789 | 135× | 15 471 |
| 4 | 131 072 | 6.27 | 56 697 | 37.4 | 1 517 | 242× | 56 463 |
| 8 | 2 048 | 7.71 | 143 | 4.5 | 32 | 4× | 109 |
| 8 | 8 192 | 7.69 | 583 | 5.1 | 114 | 15× | 544 |
| 8 | 32 768 | 7.42 | 4 520 | 11.7 | 387 | 52× | 4 434 |
| 8 | 65 536 | 7.67 | 15 615 | 20.6 | 757 | 99× | 15 457 |
| 8 | 131 072 | 7.74 | 56 991 | 40.2 | 1 419 | 183× | 56 680 |
Reading the table:
- Baseline TPOT grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8). Multi-stream decoding has small but nonzero contention even without prefill.
- Effective TPOT during grows mostly with P: a single 8k prefill stalls decode for ~580 ms regardless of D, so each stream emits only a handful of tokens during that 580 ms window — effective per-stream TPOT collapses to 100–130 ms. Larger prefill = more chunks = larger stall.
- Penalty is the eff/baseline ratio. Above 50× for P ≥ 32k. Above 500× for D=1 at P ≥ 65k.
- Max PD-disagg benefit per stream =
prefill_ttft − per_stream_tokens × baseline_TPOT≈prefill_ttft(since interference essentially halts decode). This is the entire prefill duration's worth of decode time that could in principle be recovered.
Connecting to the §3.2 PD-disagg argument (corrected):
PD-disagg's promised phase-isolation benefit is per prefill event,
not per request. When a new prefill arrives, it stalls every concurrent
decode stream on the same GPU. The aggregate decode time lost across
those D streams is D × T_prefill. PD-disagg moving prefill off-decode-GPU
recovers all of it.
Plugging numbers per prefill event:
| Prefill size | T_prefill | PD-disagg cost (MB2 T_transfer) | PD-disagg benefit (D=8 × T_prefill) | Ratio |
|---|---|---|---|---|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s | 0.7 % |
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s | 0.9 % |
| 125k tok (~p99) | 57 s | 1.9 s | 456 s | 0.4 % |
On the phase-isolation axis alone, PD-disagg wins by 100×–250×.
The reason static PD-disagg nonetheless fails in agentic is a
different failure mode: the D-side KV pool cannot fit p90+ requests
(p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide
decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's
4P+4D experiment). The structural problem is capacity (see
figs/f4b_pdsep_kv_wall.png), not transfer-cost vs phase-isolation
trade-off.
Reproduction
# vllm pair-free single-instance launch
ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'
# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
--host 127.0.0.1 --port 8000 \
--model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
--reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'
# pull + analyze
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
analysis/mb1/summary.csv
.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
--summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb1.py \
--mb1 analysis/mb1/breakdown.json \
--mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
--mb2-inter analysis/mb2/inter_kvboth_breakdown.json
# teardown
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'
Open questions / next runs
- Chunk size sensitivity: this run uses
--max-num-batched-tokens 8192. Sarathi-Serve goes smaller (e.g. 1024) and recovers more decode interleaving inside each prefill burst. Worth running chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis. - Higher D: 12, 16 streams to see whether the penalty saturates or keeps shrinking per-stream.
- Cross-validate effective_TPOT_during with token-time-series plot: raw per-token timestamps could reveal whether the stall is a few big spikes or many small ones (currently inferred from p50/p90 spread).
Run log
2026-05-27 — dash1 GPU 0, chunk_tokens=8192
3 × 5 × 3 sweep. CSV: analysis/mb1/summary.csv. Per-config JSONs on
dash1 at /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/.
Figure: figs/mb1_interference.png. The figure
figs/pd_cost_vs_benefit.png from the original commit 029821c was
based on the wrong "benefit ≤ decode duration" accounting; deleted in
the correction commit.