Files
agentic-kvc/analysis/mb1
Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:04:49 +08:00
..

MB1 — PrefillDecode Interference (chunked-prefill on, vLLM 0.18.1 default)

Persistent record of the phase-interference microbench used to put a quantitative upper bound on what PD-disaggregation can buy under the chunked-prefill-on baseline. Re-runs append a dated section at the bottom; the Summary block is what gets cited.


Summary (latest)

Headline Value
Baseline single-stream TPOT (D=1, idle GPU) 4.8 ms
Effective per-stream TPOT during 8k-token prefill burst (D=8) 114 ms (≈15× baseline)
Effective per-stream TPOT during 32k-token prefill burst (D=8) 388 ms (≈52×)
Effective per-stream TPOT during 131k-token prefill burst (D=8) 1419 ms (≈183×)

What MB1 actually measures:

During a prefill burst, every ongoing decode stream is essentially halted (per-stream effective TPOT is 15×2000× baseline, scaling with prefill size). The total decode time lost per prefill event is D × T_prefill (D concurrent decodes each lose ~T_prefill of useful work). For the trace mean (P ≈ 33k tokens, T_prefill ≈ 4.5 s) at D=8 that's ~36 seconds of decode-equivalent work lost per request. This is the upper bound on what PD-disaggregation's phase isolation could recover on the decode side.

⚠ Correction (2026-05-27): an earlier version of this README framed the §3.2 PD-disagg argument as "phase-isolation benefit is capped at the decode duration of the new request (~50200 ms), so MB2 transfer cost dominates". That framing was wrong. The correct accounting is benefit-per-prefill-event = D × T_prefill (aggregate decode time saved across all stalled streams), which is much larger than per-request transfer cost. The actual reason static PD-disagg fails in agentic is D-side KV pool capacity (figs/f4b_pdsep_kv_wall.png), not a cost-vs-benefit imbalance on phase isolation. See RESULTS_SUMMARY.md section 4 for the corrected framing.

Setup

Component Value
Host dash1, H20 96 GiB, driver 570.133.20
Venv /home/admin/cpfs/wjh/agentic-kv-fresh/.venv
vLLM 0.18.1 official wheel (chunked-prefill default-on, V1 engine)
Model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
Launch flags --tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192
kv_connector none (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2)

Method

Adapted from microbench/interference/driver.py:

  1. Start D streaming decode requests on /v1/chat/completions with a long max_tokens cap. Discard the first 32 tokens as warmup.
  2. After 1 s, inject one prefill-only request with max_tokens=1 and an input of P synthetic tokens (uuid-seeded for zero prefix-cache reuse). Measure the prefill's TTFT.
  3. Bin the during-prefill tokens from each decode stream by whether their wall-clock falls inside [prefill_inject_ts, prefill_inject_ts + prefill_ttft]. Report inter-token p50 / p90.
  4. Bin a baseline run (D streams, no prefill injection) the same way.

We additionally compute the effective per-stream TPOT during the prefill burst as the single most informative summary:

eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)

This is the average rate at which each decode stream produces tokens while a prefill is in flight. Compared to baseline TPOT it gives the real per-stream throughput penalty (chunked-prefill p50 looks deceptively fine because most decode-token intervals during the burst are at normal speed; p90 sees the stall but is itself noisy; the effective TPOT is the cleanest "average over the whole burst window" number).

Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192

3 D × 5 P × 3 reps. Aggregated by analyze_mb1.py.

D P (tok) base TPOT (ms) prefill_ttft (ms) per-stream tokens during effective TPOT during (ms) penalty max PD-disagg benefit per stream (ms)
1 2 048 4.79 163 4.0 41 8× 144
1 8 192 4.78 584 5.0 117 24× 560
1 32 768 4.78 4 515 5.0 903 189× 4 491
1 65 536 4.78 15 568 5.3 2 919 610× 15 542
1 131 072 4.78 56 765 5.7 10 017 2 094× 56 738
4 2 048 5.62 138 3.9 36 6× 117
4 8 192 6.08 574 4.5 128 21× 547
4 32 768 6.09 4 529 11.9 381 63× 4 457
4 65 536 5.85 15 587 19.8 789 135× 15 471
4 131 072 6.27 56 697 37.4 1 517 242× 56 463
8 2 048 7.71 143 4.5 32 4× 109
8 8 192 7.69 583 5.1 114 15× 544
8 32 768 7.42 4 520 11.7 387 52× 4 434
8 65 536 7.67 15 615 20.6 757 99× 15 457
8 131 072 7.74 56 991 40.2 1 419 183× 56 680

Reading the table:

  • Baseline TPOT grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8). Multi-stream decoding has small but nonzero contention even without prefill.
  • Effective TPOT during grows mostly with P: a single 8k prefill stalls decode for ~580 ms regardless of D, so each stream emits only a handful of tokens during that 580 ms window — effective per-stream TPOT collapses to 100130 ms. Larger prefill = more chunks = larger stall.
  • Penalty is the eff/baseline ratio. Above 50× for P ≥ 32k. Above 500× for D=1 at P ≥ 65k.
  • Max PD-disagg benefit per stream = prefill_ttft per_stream_tokens × baseline_TPOTprefill_ttft (since interference essentially halts decode). This is the entire prefill duration's worth of decode time that could in principle be recovered.

Connecting to the §3.2 PD-disagg argument (corrected):

PD-disagg's promised phase-isolation benefit is per prefill event, not per request. When a new prefill arrives, it stalls every concurrent decode stream on the same GPU. The aggregate decode time lost across those D streams is D × T_prefill. PD-disagg moving prefill off-decode-GPU recovers all of it.

Plugging numbers per prefill event:

Prefill size T_prefill PD-disagg cost (MB2 T_transfer) PD-disagg benefit (D=8 × T_prefill) Ratio
2k tok (trace lower) 0.14 s 8 ms 1.1 s 0.7 %
33k tok (trace mean) 4.5 s 320 ms 36 s 0.9 %
125k tok (~p99) 57 s 1.9 s 456 s 0.4 %

On the phase-isolation axis alone, PD-disagg wins by 100×250×. The reason static PD-disagg nonetheless fails in agentic is a different failure mode: the D-side KV pool cannot fit p90+ requests (p99 = 11.5 GiB; D-instance pool ≈ 38 GiB; 4P+4D halves system-wide decode capacity → TTFT p50 62×, success rate 99.5% → 52% in colleague's 4P+4D experiment). The structural problem is capacity (see figs/f4b_pdsep_kv_wall.png), not transfer-cost vs phase-isolation trade-off.

Reproduction

# vllm pair-free single-instance launch
ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
  bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
    --host 127.0.0.1 --port 8000 \
    --model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
    --reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'

# pull + analyze
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
    analysis/mb1/summary.csv
.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
  --summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb1.py \
  --mb1 analysis/mb1/breakdown.json \
  --mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
  --mb2-inter analysis/mb2/inter_kvboth_breakdown.json

# teardown
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'

Open questions / next runs

  • Chunk size sensitivity: this run uses --max-num-batched-tokens 8192. Sarathi-Serve goes smaller (e.g. 1024) and recovers more decode interleaving inside each prefill burst. Worth running chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis.
  • Higher D: 12, 16 streams to see whether the penalty saturates or keeps shrinking per-stream.
  • Cross-validate effective_TPOT_during with token-time-series plot: raw per-token timestamps could reveal whether the stall is a few big spikes or many small ones (currently inferred from p50/p90 spread).

Run log

2026-05-27 — dash1 GPU 0, chunk_tokens=8192

3 × 5 × 3 sweep. CSV: analysis/mb1/summary.csv. Per-config JSONs on dash1 at /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/. Figure: figs/mb1_interference.png. The figure figs/pd_cost_vs_benefit.png from the original commit 029821c was based on the wrong "benefit ≤ decode duration" accounting; deleted in the correction commit.