MB1: prefill-decode interference under chunked-prefill default; §3.2 headline

Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 21:25:09 +08:00
parent 90127c3389
commit 029821c1b6
9 changed files with 1151 additions and 0 deletions

177
analysis/mb1/README.md Normal file
View File

@@ -0,0 +1,177 @@
# MB1 — PrefillDecode Interference (chunked-prefill on, vLLM 0.18.1 default)
Persistent record of the phase-interference microbench used to put a
quantitative upper bound on **what PD-disaggregation can buy** under the
chunked-prefill-on baseline. Re-runs append a dated section at the
bottom; the **Summary** block is what gets cited.
---
## Summary (latest)
| Headline | Value |
|---|---|
| Baseline single-stream TPOT (D=1, idle GPU) | **4.8 ms** |
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
| Maximum PD-disagg benefit per agentic decode | **≤ 50200 ms** (= decode duration) |
**§3.2 headline (cost vs benefit, this run + MB2)**:
> Under chunked-prefill, every ongoing decode stream is essentially
> **halted while a prefill chunk is in flight** — per-stream effective
> TPOT during the burst is 15× to 2000× baseline, scaling with prefill
> size. PD-disagg can recover this stall, but the recovery is bounded by
> the **decode duration** of the request being protected. For agentic,
> decode is 50200 ms (tool-call output). MB2 shows PD-disagg pays
> 300 ms 10 s of KV-transfer cost per request to do that recovery. The
> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB
> (~830 tokens) — well below all agentic operating points. The benefit
> never beats the cost in this workload.
## Setup
| Component | Value |
|---|---|
| Host | dash1, H20 96 GiB, driver 570.133.20 |
| Venv | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` |
| vLLM | 0.18.1 official wheel (chunked-prefill default-on, V1 engine) |
| Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` |
| Launch flags | `--tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192` |
| kv_connector | **none** (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2) |
## Method
Adapted from `microbench/interference/driver.py`:
1. Start D streaming decode requests on `/v1/chat/completions` with a
long max_tokens cap. Discard the first 32 tokens as warmup.
2. After 1 s, inject one prefill-only request with `max_tokens=1` and
an input of `P` synthetic tokens (uuid-seeded for zero prefix-cache
reuse). Measure the prefill's TTFT.
3. Bin the *during-prefill* tokens from each decode stream by whether
their wall-clock falls inside `[prefill_inject_ts, prefill_inject_ts +
prefill_ttft]`. Report inter-token p50 / p90.
4. Bin a baseline run (D streams, no prefill injection) the same way.
We additionally compute the **effective per-stream TPOT during the
prefill burst** as the single most informative summary:
```
eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
```
This is the average rate at which each decode stream produces tokens
while a prefill is in flight. Compared to baseline TPOT it gives the
real per-stream throughput penalty (chunked-prefill p50 looks deceptively
fine because most decode-token intervals during the burst are at normal
speed; p90 sees the stall but is itself noisy; the effective TPOT is
the cleanest "average over the whole burst window" number).
## Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192
3 D × 5 P × 3 reps. Aggregated by `analyze_mb1.py`.
| D | P (tok) | base TPOT (ms) | prefill_ttft (ms) | per-stream tokens during | effective TPOT during (ms) | penalty | max PD-disagg benefit per stream (ms) |
|--:|--:|--:|--:|--:|--:|--:|--:|
| 1 | 2 048 | 4.79 | 163 | 4.0 | 41 | 8× | 144 |
| 1 | 8 192 | 4.78 | 584 | 5.0 | 117 | 24× | 560 |
| 1 | 32 768 | 4.78 | 4 515 | 5.0 | 903 | 189× | 4 491 |
| 1 | 65 536 | 4.78 | 15 568 | 5.3 | 2 919 | 610× | 15 542 |
| 1 | 131 072 | 4.78 | 56 765 | 5.7 | 10 017 | 2 094× | 56 738 |
| 4 | 2 048 | 5.62 | 138 | 3.9 | 36 | 6× | 117 |
| 4 | 8 192 | 6.08 | 574 | 4.5 | 128 | 21× | 547 |
| 4 | 32 768 | 6.09 | 4 529 | 11.9 | 381 | 63× | 4 457 |
| 4 | 65 536 | 5.85 | 15 587 | 19.8 | 789 | 135× | 15 471 |
| 4 | 131 072 | 6.27 | 56 697 | 37.4 | 1 517 | 242× | 56 463 |
| 8 | 2 048 | 7.71 | 143 | 4.5 | 32 | 4× | 109 |
| 8 | 8 192 | 7.69 | 583 | 5.1 | 114 | 15× | 544 |
| 8 | 32 768 | 7.42 | 4 520 | 11.7 | 387 | 52× | 4 434 |
| 8 | 65 536 | 7.67 | 15 615 | 20.6 | 757 | 99× | 15 457 |
| 8 | 131 072 | 7.74 | 56 991 | 40.2 | 1 419 | 183× | 56 680 |
**Reading the table**:
- *Baseline TPOT* grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8).
Multi-stream decoding has small but nonzero contention even without
prefill.
- *Effective TPOT during* grows mostly with P: a single 8k prefill stalls
decode for ~580 ms regardless of D, so each stream emits only a handful
of tokens during that 580 ms window — effective per-stream TPOT
collapses to 100130 ms. Larger prefill = more chunks = larger stall.
- *Penalty* is the eff/baseline ratio. Above 50× for P ≥ 32k. Above
500× for D=1 at P ≥ 65k.
- *Max PD-disagg benefit per stream* = `prefill_ttft per_stream_tokens
× baseline_TPOT` ≈ `prefill_ttft` (since interference essentially
halts decode). This is the entire prefill duration's worth of decode
time that could in principle be recovered.
Two big caveats for **agentic** application:
1. **Decode is short** (~50200 ms for tool-call output). The actual
recoverable benefit per request is bounded by the decode duration,
not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill
collides with it, PD-disagg can save at most 100 ms — not 5 s.
2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms 10 s per request
for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
~100 ms benefit ceiling. Cost > benefit across the whole agentic
distribution.
## §3.2 cost-vs-benefit figure
`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50200 ms
band, capped by decode duration) on top of MB2 transfer cost curve. The
cost curve crosses the benefit ceiling somewhere around **80 MiB / 830
tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace
mean per request KV, and we know agentic averages 33k tokens, p99
125k). For anything bigger PD-disagg pays more than it can recover.
## Reproduction
```bash
# vllm pair-free single-instance launch
ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'
# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
--host 127.0.0.1 --port 8000 \
--model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
--reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'
# pull + analyze
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
analysis/mb1/summary.csv
.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
--summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb1.py \
--mb1 analysis/mb1/breakdown.json \
--mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
--mb2-inter analysis/mb2/inter_kvboth_breakdown.json
# teardown
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'
```
## Open questions / next runs
- **Chunk size sensitivity**: this run uses `--max-num-batched-tokens
8192`. Sarathi-Serve goes smaller (e.g. 1024) and recovers more
decode interleaving inside each prefill burst. Worth running
chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis.
- **Higher D**: 12, 16 streams to see whether the penalty saturates or
keeps shrinking per-stream.
- **Cross-validate effective_TPOT_during with token-time-series plot**:
raw per-token timestamps could reveal whether the stall is a few big
spikes or many small ones (currently inferred from p50/p90 spread).
## Run log
### 2026-05-27 — dash1 GPU 0, chunk_tokens=8192
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`.

199
analysis/mb1/breakdown.json Normal file
View File

@@ -0,0 +1,199 @@
{
"summary": [
{
"decode_batch_size": 1,
"new_prefill_tokens": 2048,
"baseline_tpot_ms": 4.79,
"during_tpot_p50_ms_raw": 35.43,
"during_tpot_p90_ms_raw": 79.91,
"prefill_ttft_ms": 163.3,
"num_tokens_during_prefill_total": 4.0,
"per_stream_tokens_during": 4.0,
"effective_tpot_during_ms": 40.8,
"interference_penalty_x": 8.5,
"max_pd_disagg_benefit_ms_per_stream": 144.2
},
{
"decode_batch_size": 1,
"new_prefill_tokens": 8192,
"baseline_tpot_ms": 4.78,
"during_tpot_p50_ms_raw": 6.56,
"during_tpot_p90_ms_raw": 328.57,
"prefill_ttft_ms": 583.9,
"num_tokens_during_prefill_total": 5.0,
"per_stream_tokens_during": 5.0,
"effective_tpot_during_ms": 116.8,
"interference_penalty_x": 24.4,
"max_pd_disagg_benefit_ms_per_stream": 560.0
},
{
"decode_batch_size": 1,
"new_prefill_tokens": 32768,
"baseline_tpot_ms": 4.78,
"during_tpot_p50_ms_raw": 4.75,
"during_tpot_p90_ms_raw": 4.9,
"prefill_ttft_ms": 4515.3,
"num_tokens_during_prefill_total": 5.0,
"per_stream_tokens_during": 5.0,
"effective_tpot_during_ms": 903.1,
"interference_penalty_x": 188.8,
"max_pd_disagg_benefit_ms_per_stream": 4491.4
},
{
"decode_batch_size": 1,
"new_prefill_tokens": 65536,
"baseline_tpot_ms": 4.78,
"during_tpot_p50_ms_raw": 4.69,
"during_tpot_p90_ms_raw": 4.97,
"prefill_ttft_ms": 15567.6,
"num_tokens_during_prefill_total": 5.3,
"per_stream_tokens_during": 5.33,
"effective_tpot_during_ms": 2918.9,
"interference_penalty_x": 610.2,
"max_pd_disagg_benefit_ms_per_stream": 15542.0
},
{
"decode_batch_size": 1,
"new_prefill_tokens": 131072,
"baseline_tpot_ms": 4.78,
"during_tpot_p50_ms_raw": 4.71,
"during_tpot_p90_ms_raw": 4.9,
"prefill_ttft_ms": 56765.2,
"num_tokens_during_prefill_total": 5.7,
"per_stream_tokens_during": 5.67,
"effective_tpot_during_ms": 10017.4,
"interference_penalty_x": 2094.5,
"max_pd_disagg_benefit_ms_per_stream": 56738.1
},
{
"decode_batch_size": 4,
"new_prefill_tokens": 2048,
"baseline_tpot_ms": 5.62,
"during_tpot_p50_ms_raw": 22.18,
"during_tpot_p90_ms_raw": 84.85,
"prefill_ttft_ms": 138.3,
"num_tokens_during_prefill_total": 15.5,
"per_stream_tokens_during": 3.88,
"effective_tpot_during_ms": 35.7,
"interference_penalty_x": 6.3,
"max_pd_disagg_benefit_ms_per_stream": 116.6
},
{
"decode_batch_size": 4,
"new_prefill_tokens": 8192,
"baseline_tpot_ms": 6.08,
"during_tpot_p50_ms_raw": 8.45,
"during_tpot_p90_ms_raw": 515.39,
"prefill_ttft_ms": 574.1,
"num_tokens_during_prefill_total": 18.0,
"per_stream_tokens_during": 4.5,
"effective_tpot_during_ms": 127.6,
"interference_penalty_x": 21.0,
"max_pd_disagg_benefit_ms_per_stream": 546.8
},
{
"decode_batch_size": 4,
"new_prefill_tokens": 32768,
"baseline_tpot_ms": 6.09,
"during_tpot_p50_ms_raw": 9.83,
"during_tpot_p90_ms_raw": 1314.87,
"prefill_ttft_ms": 4529.1,
"num_tokens_during_prefill_total": 47.5,
"per_stream_tokens_during": 11.88,
"effective_tpot_during_ms": 381.4,
"interference_penalty_x": 62.7,
"max_pd_disagg_benefit_ms_per_stream": 4456.9
},
{
"decode_batch_size": 4,
"new_prefill_tokens": 65536,
"baseline_tpot_ms": 5.85,
"during_tpot_p50_ms_raw": 6.41,
"during_tpot_p90_ms_raw": 2077.47,
"prefill_ttft_ms": 15586.5,
"num_tokens_during_prefill_total": 79.0,
"per_stream_tokens_during": 19.75,
"effective_tpot_during_ms": 789.2,
"interference_penalty_x": 135.0,
"max_pd_disagg_benefit_ms_per_stream": 15471.0
},
{
"decode_batch_size": 4,
"new_prefill_tokens": 131072,
"baseline_tpot_ms": 6.27,
"during_tpot_p50_ms_raw": 6.3,
"during_tpot_p90_ms_raw": 4405.18,
"prefill_ttft_ms": 56697.1,
"num_tokens_during_prefill_total": 149.5,
"per_stream_tokens_during": 37.38,
"effective_tpot_during_ms": 1517.0,
"interference_penalty_x": 241.8,
"max_pd_disagg_benefit_ms_per_stream": 56462.6
},
{
"decode_batch_size": 8,
"new_prefill_tokens": 2048,
"baseline_tpot_ms": 7.71,
"during_tpot_p50_ms_raw": 8.38,
"during_tpot_p90_ms_raw": 98.98,
"prefill_ttft_ms": 143.1,
"num_tokens_during_prefill_total": 35.7,
"per_stream_tokens_during": 4.46,
"effective_tpot_during_ms": 32.1,
"interference_penalty_x": 4.2,
"max_pd_disagg_benefit_ms_per_stream": 108.8
},
{
"decode_batch_size": 8,
"new_prefill_tokens": 8192,
"baseline_tpot_ms": 7.69,
"during_tpot_p50_ms_raw": 9.34,
"during_tpot_p90_ms_raw": 519.29,
"prefill_ttft_ms": 583.3,
"num_tokens_during_prefill_total": 41.0,
"per_stream_tokens_during": 5.12,
"effective_tpot_during_ms": 113.8,
"interference_penalty_x": 14.8,
"max_pd_disagg_benefit_ms_per_stream": 543.9
},
{
"decode_batch_size": 8,
"new_prefill_tokens": 32768,
"baseline_tpot_ms": 7.42,
"during_tpot_p50_ms_raw": 11.61,
"during_tpot_p90_ms_raw": 1315.48,
"prefill_ttft_ms": 4520.3,
"num_tokens_during_prefill_total": 93.3,
"per_stream_tokens_during": 11.67,
"effective_tpot_during_ms": 387.5,
"interference_penalty_x": 52.2,
"max_pd_disagg_benefit_ms_per_stream": 4433.7
},
{
"decode_batch_size": 8,
"new_prefill_tokens": 65536,
"baseline_tpot_ms": 7.67,
"during_tpot_p50_ms_raw": 19.09,
"during_tpot_p90_ms_raw": 2471.4,
"prefill_ttft_ms": 15615.5,
"num_tokens_during_prefill_total": 165.0,
"per_stream_tokens_during": 20.62,
"effective_tpot_during_ms": 757.1,
"interference_penalty_x": 98.8,
"max_pd_disagg_benefit_ms_per_stream": 15457.4
},
{
"decode_batch_size": 8,
"new_prefill_tokens": 131072,
"baseline_tpot_ms": 7.74,
"during_tpot_p50_ms_raw": 11.51,
"during_tpot_p90_ms_raw": 4895.27,
"prefill_ttft_ms": 56991.4,
"num_tokens_during_prefill_total": 321.3,
"per_stream_tokens_during": 40.17,
"effective_tpot_during_ms": 1418.9,
"interference_penalty_x": 183.3,
"max_pd_disagg_benefit_ms_per_stream": 56680.4
}
]
}

46
analysis/mb1/summary.csv Normal file
View File

@@ -0,0 +1,46 @@
chunk_size,decode_batch_size,new_prefill_tokens,repetition,tpot_baseline_p50_ms,tpot_baseline_p90_ms,tpot_during_prefill_p50_ms,tpot_during_prefill_p90_ms,tpot_after_prefill_p50_ms,prefill_ttft_ms,num_tokens_during_prefill,tpot_penalty_p50_ms,tpot_penalty_ratio
8192,1,131072,0,4.777565016411245,4.900234832894057,4.701301048044115,4.948397364933044,0.0,56719.25117995124,7,-0.07626396836712956,0.9840370632099913
8192,1,131072,1,4.779465030878782,4.883405601140112,4.707481013610959,4.85471700085327,0.0,56696.089847013354,5,-0.07198401726782322,0.9849388965495606
8192,1,131072,2,4.790953011251986,4.880544205661863,4.728371975943446,4.907831805758178,0.0,56880.19039196661,5,-0.06258103530853987,0.9869376645603573
8192,1,2048,0,4.77885699365288,4.894876398611814,41.434570477576926,88.97331730695441,0.0,183.2046649651602,4,36.655713483924046,8.670393471202205
8192,1,2048,1,4.788161953911185,4.949774022679776,41.68213551747613,83.5143867880106,0.0,175.55483896285295,4,36.89397356356494,8.705247633369687
8192,1,2048,2,4.7893429873511195,4.874200583435595,23.186982492916286,67.25202781381086,0.0,131.23180496040732,4,18.397639505565166,4.841370215946989
8192,1,32768,0,4.789774015080184,4.870833398308605,4.738486022688448,4.886626999359578,0.0,4500.839321000967,5,-0.051287992391735315,0.9892921895207875
8192,1,32768,1,4.776834975928068,4.891659819986671,4.729953012429178,4.9245511763729155,0.0,4496.073378017172,5,-0.0468819634988904,0.9901855593221991
8192,1,32768,2,4.784431017469615,4.866032593417913,4.782894975505769,4.8977664206177,0.0,4549.013931944501,5,-0.0015360419638454914,0.9996789499193871
8192,1,65536,0,4.778854956384748,4.9255444086156785,4.633405013009906,4.895579582080245,0.0,15530.37424501963,5,-0.1454499433748424,0.9695638506080803
8192,1,65536,1,4.784283053595573,4.8808404128067195,4.754905996378511,4.985795798711479,0.0,15584.887631004676,5,-0.02937705721706152,0.99385967408534
8192,1,65536,2,4.787993966601789,4.9004736240021884,4.6836750116199255,5.0271204963792115,0.0,15587.390075030271,6,-0.1043189549818635,0.9782123879625725
8192,1,8192,0,4.785028984770179,4.878618801012635,7.490115996915847,324.06569679733366,0.0,573.2795029762201,5,2.7050870121456683,1.565323014919123
8192,1,8192,1,4.778591974172741,4.899543372448534,5.9131429879926145,336.8099076091312,0.0,606.6823820001446,5,1.1345510138198733,1.237423705550061
8192,1,8192,2,4.78826800826937,4.90188361145556,6.276679981965572,324.8370993998833,0.0,571.7499859747477,5,1.488411973696202,1.310845585736994
8192,4,131072,0,6.113810988608748,6.309205386787653,0.0,0.0,0.0,56702.702289039735,0,-6.113810988608748,0.0
8192,4,131072,1,6.630807969486341,7.086459483252838,6.2820459716022015,4400.500871409893,0.0,56807.70832300186,150,-0.3487619978841394,0.9474027902045915
8192,4,131072,2,6.073819473385811,6.344516028184444,6.326125003397465,4409.856556192978,0.0,56580.784838995896,149,0.2523055300116539,1.0415398467335428
8192,4,2048,0,5.402160517405719,5.543816485442221,6.210724503034726,84.62208869168535,6.125201500253752,140.3041940066032,18,0.8085639856290072,1.1496741873966574
8192,4,2048,1,6.067108013667166,6.381415005307645,0.0,0.0,0.0,140.06177097326145,0,-6.067108013667166,0.0
8192,4,2048,2,5.400336522143334,5.536347016459331,38.15686801681295,85.07051098858938,5.25214200024493,134.67552902875468,13,32.756531494669616,7.065646346363043
8192,4,32768,0,6.115561525803059,6.369604001520202,7.216634490760043,1314.6978712815326,5.17624247004278,4522.433568025008,50,1.101072964956984,1.1800444587649532
8192,4,32768,1,6.070095987524837,6.3612310332246125,0.0,0.0,0.0,4508.074064040557,0,-6.070095987524837,0.0
8192,4,32768,2,6.0734800063073635,6.312666402664036,12.442811043001711,1315.0411327951588,4.754714027512819,4556.892123946454,45,6.369331036694348,2.0487119460473635
8192,4,65536,0,5.406292999396101,5.540905491216108,0.0,0.0,0.0,15581.590663990937,0,-5.406292999396101,0.0
8192,4,65536,1,6.076910009142011,6.315114628523588,0.0,0.0,0.0,15574.196094006766,0,-6.076910009142011,0.0
8192,4,65536,2,6.060379033442587,6.384042033459991,6.411670008674264,2077.4700703914277,4.8022730043157935,15603.720718005206,79,0.3512909752316773,1.0579651822589267
8192,4,8192,0,6.110575021011755,6.416070973500609,8.451583969872445,515.3855616226792,5.358011490898207,574.6672929963097,18,2.34100894886069,1.3831077993169092
8192,4,8192,1,6.051429023500532,6.398122606333345,0.0,0.0,0.0,573.6081749782898,0,-6.051429023500532,0.0
8192,4,8192,2,6.064729997888207,6.366449000779539,0.0,0.0,0.0,574.1707819979638,0,-6.064729997888207,0.0
8192,8,131072,0,7.737616979284212,7.99839201499708,10.740376019384712,4742.438135773409,7.792441989295185,57010.66731195897,335,3.0027590401005,1.388072845701685
8192,8,131072,1,7.744895527139306,8.013638522243127,8.647068490972742,5123.228083999129,7.672236970392987,56970.40947602363,310,0.9021729638334364,1.116486137310966
8192,8,131072,2,7.740180502878502,8.016240986762568,15.140031988266855,4820.136589207682,7.68946303287521,56993.02393599646,319,7.3998514853883535,1.9560308680962177
8192,8,2048,0,7.741285488009453,8.022559515666217,8.103576023131609,124.87094267853536,7.6825070136692375,141.97922096354887,30,0.36229053512215614,1.046799789993963
8192,8,2048,1,7.728310010861605,8.021069981623441,8.17067950265482,84.82906777062453,7.745136506855488,144.1582590341568,38,0.4423694917932153,1.0572401328584768
8192,8,2048,2,7.662211020942777,8.034424972720444,8.87883099494502,87.23540699575096,7.592331967316568,143.27958395006135,39,1.216619974002242,1.1587818412566437
8192,8,32768,0,7.295333489309996,7.422819995554164,11.429400008637458,1315.43214758276,7.8034960315562785,4523.641717038117,94,4.134066519327462,1.5666727265292526
8192,8,32768,1,7.278127042809501,7.490781514206901,12.640403030673042,1315.491412486881,7.821676495950669,4519.993302994408,90,5.362275987863541,1.736765922925357
8192,8,32768,2,7.684049021918327,8.047712198458612,10.752685484476388,1315.5166705255397,7.80402502277866,4517.200137954205,96,3.068636462558061,1.3993514947399404
8192,8,65536,0,7.708174001891166,8.017168991500512,26.662671996746212,2496.8427699001018,7.768569514155388,15603.601168957539,160,18.954497994855046,3.459012729889679
8192,8,65536,1,7.594842027174309,7.9874323040712625,13.054963492322713,2459.1690181812737,7.54699349636212,15620.474929979537,174,5.460121465148404,1.7189249553331216
8192,8,65536,2,7.693717983784154,7.933055714238435,17.5579380011186,2458.176895044744,7.808708498487249,15622.32490995666,161,9.864220017334446,2.2821135422594123
8192,8,8192,0,7.636573514901102,7.904737605713308,10.151655005756766,514.8188057704829,7.7977380133233964,575.7745200535282,37,2.515081490855664,1.3293468577167538
8192,8,8192,1,7.687711506150663,7.965393498307094,9.002390026580542,524.0793236298487,7.753994490485638,592.1044679707848,45,1.3146785204298794,1.1710103870804793
8192,8,8192,2,7.756220467854291,8.035426988499239,8.864110975991935,518.9726910321042,7.770269992761314,581.98908099439,41,1.1078905081376433,1.1428389655411813
1 chunk_size decode_batch_size new_prefill_tokens repetition tpot_baseline_p50_ms tpot_baseline_p90_ms tpot_during_prefill_p50_ms tpot_during_prefill_p90_ms tpot_after_prefill_p50_ms prefill_ttft_ms num_tokens_during_prefill tpot_penalty_p50_ms tpot_penalty_ratio
2 8192 1 131072 0 4.777565016411245 4.900234832894057 4.701301048044115 4.948397364933044 0.0 56719.25117995124 7 -0.07626396836712956 0.9840370632099913
3 8192 1 131072 1 4.779465030878782 4.883405601140112 4.707481013610959 4.85471700085327 0.0 56696.089847013354 5 -0.07198401726782322 0.9849388965495606
4 8192 1 131072 2 4.790953011251986 4.880544205661863 4.728371975943446 4.907831805758178 0.0 56880.19039196661 5 -0.06258103530853987 0.9869376645603573
5 8192 1 2048 0 4.77885699365288 4.894876398611814 41.434570477576926 88.97331730695441 0.0 183.2046649651602 4 36.655713483924046 8.670393471202205
6 8192 1 2048 1 4.788161953911185 4.949774022679776 41.68213551747613 83.5143867880106 0.0 175.55483896285295 4 36.89397356356494 8.705247633369687
7 8192 1 2048 2 4.7893429873511195 4.874200583435595 23.186982492916286 67.25202781381086 0.0 131.23180496040732 4 18.397639505565166 4.841370215946989
8 8192 1 32768 0 4.789774015080184 4.870833398308605 4.738486022688448 4.886626999359578 0.0 4500.839321000967 5 -0.051287992391735315 0.9892921895207875
9 8192 1 32768 1 4.776834975928068 4.891659819986671 4.729953012429178 4.9245511763729155 0.0 4496.073378017172 5 -0.0468819634988904 0.9901855593221991
10 8192 1 32768 2 4.784431017469615 4.866032593417913 4.782894975505769 4.8977664206177 0.0 4549.013931944501 5 -0.0015360419638454914 0.9996789499193871
11 8192 1 65536 0 4.778854956384748 4.9255444086156785 4.633405013009906 4.895579582080245 0.0 15530.37424501963 5 -0.1454499433748424 0.9695638506080803
12 8192 1 65536 1 4.784283053595573 4.8808404128067195 4.754905996378511 4.985795798711479 0.0 15584.887631004676 5 -0.02937705721706152 0.99385967408534
13 8192 1 65536 2 4.787993966601789 4.9004736240021884 4.6836750116199255 5.0271204963792115 0.0 15587.390075030271 6 -0.1043189549818635 0.9782123879625725
14 8192 1 8192 0 4.785028984770179 4.878618801012635 7.490115996915847 324.06569679733366 0.0 573.2795029762201 5 2.7050870121456683 1.565323014919123
15 8192 1 8192 1 4.778591974172741 4.899543372448534 5.9131429879926145 336.8099076091312 0.0 606.6823820001446 5 1.1345510138198733 1.237423705550061
16 8192 1 8192 2 4.78826800826937 4.90188361145556 6.276679981965572 324.8370993998833 0.0 571.7499859747477 5 1.488411973696202 1.310845585736994
17 8192 4 131072 0 6.113810988608748 6.309205386787653 0.0 0.0 0.0 56702.702289039735 0 -6.113810988608748 0.0
18 8192 4 131072 1 6.630807969486341 7.086459483252838 6.2820459716022015 4400.500871409893 0.0 56807.70832300186 150 -0.3487619978841394 0.9474027902045915
19 8192 4 131072 2 6.073819473385811 6.344516028184444 6.326125003397465 4409.856556192978 0.0 56580.784838995896 149 0.2523055300116539 1.0415398467335428
20 8192 4 2048 0 5.402160517405719 5.543816485442221 6.210724503034726 84.62208869168535 6.125201500253752 140.3041940066032 18 0.8085639856290072 1.1496741873966574
21 8192 4 2048 1 6.067108013667166 6.381415005307645 0.0 0.0 0.0 140.06177097326145 0 -6.067108013667166 0.0
22 8192 4 2048 2 5.400336522143334 5.536347016459331 38.15686801681295 85.07051098858938 5.25214200024493 134.67552902875468 13 32.756531494669616 7.065646346363043
23 8192 4 32768 0 6.115561525803059 6.369604001520202 7.216634490760043 1314.6978712815326 5.17624247004278 4522.433568025008 50 1.101072964956984 1.1800444587649532
24 8192 4 32768 1 6.070095987524837 6.3612310332246125 0.0 0.0 0.0 4508.074064040557 0 -6.070095987524837 0.0
25 8192 4 32768 2 6.0734800063073635 6.312666402664036 12.442811043001711 1315.0411327951588 4.754714027512819 4556.892123946454 45 6.369331036694348 2.0487119460473635
26 8192 4 65536 0 5.406292999396101 5.540905491216108 0.0 0.0 0.0 15581.590663990937 0 -5.406292999396101 0.0
27 8192 4 65536 1 6.076910009142011 6.315114628523588 0.0 0.0 0.0 15574.196094006766 0 -6.076910009142011 0.0
28 8192 4 65536 2 6.060379033442587 6.384042033459991 6.411670008674264 2077.4700703914277 4.8022730043157935 15603.720718005206 79 0.3512909752316773 1.0579651822589267
29 8192 4 8192 0 6.110575021011755 6.416070973500609 8.451583969872445 515.3855616226792 5.358011490898207 574.6672929963097 18 2.34100894886069 1.3831077993169092
30 8192 4 8192 1 6.051429023500532 6.398122606333345 0.0 0.0 0.0 573.6081749782898 0 -6.051429023500532 0.0
31 8192 4 8192 2 6.064729997888207 6.366449000779539 0.0 0.0 0.0 574.1707819979638 0 -6.064729997888207 0.0
32 8192 8 131072 0 7.737616979284212 7.99839201499708 10.740376019384712 4742.438135773409 7.792441989295185 57010.66731195897 335 3.0027590401005 1.388072845701685
33 8192 8 131072 1 7.744895527139306 8.013638522243127 8.647068490972742 5123.228083999129 7.672236970392987 56970.40947602363 310 0.9021729638334364 1.116486137310966
34 8192 8 131072 2 7.740180502878502 8.016240986762568 15.140031988266855 4820.136589207682 7.68946303287521 56993.02393599646 319 7.3998514853883535 1.9560308680962177
35 8192 8 2048 0 7.741285488009453 8.022559515666217 8.103576023131609 124.87094267853536 7.6825070136692375 141.97922096354887 30 0.36229053512215614 1.046799789993963
36 8192 8 2048 1 7.728310010861605 8.021069981623441 8.17067950265482 84.82906777062453 7.745136506855488 144.1582590341568 38 0.4423694917932153 1.0572401328584768
37 8192 8 2048 2 7.662211020942777 8.034424972720444 8.87883099494502 87.23540699575096 7.592331967316568 143.27958395006135 39 1.216619974002242 1.1587818412566437
38 8192 8 32768 0 7.295333489309996 7.422819995554164 11.429400008637458 1315.43214758276 7.8034960315562785 4523.641717038117 94 4.134066519327462 1.5666727265292526
39 8192 8 32768 1 7.278127042809501 7.490781514206901 12.640403030673042 1315.491412486881 7.821676495950669 4519.993302994408 90 5.362275987863541 1.736765922925357
40 8192 8 32768 2 7.684049021918327 8.047712198458612 10.752685484476388 1315.5166705255397 7.80402502277866 4517.200137954205 96 3.068636462558061 1.3993514947399404
41 8192 8 65536 0 7.708174001891166 8.017168991500512 26.662671996746212 2496.8427699001018 7.768569514155388 15603.601168957539 160 18.954497994855046 3.459012729889679
42 8192 8 65536 1 7.594842027174309 7.9874323040712625 13.054963492322713 2459.1690181812737 7.54699349636212 15620.474929979537 174 5.460121465148404 1.7189249553331216
43 8192 8 65536 2 7.693717983784154 7.933055714238435 17.5579380011186 2458.176895044744 7.808708498487249 15622.32490995666 161 9.864220017334446 2.2821135422594123
44 8192 8 8192 0 7.636573514901102 7.904737605713308 10.151655005756766 514.8188057704829 7.7977380133233964 575.7745200535282 37 2.515081490855664 1.3293468577167538
45 8192 8 8192 1 7.687711506150663 7.965393498307094 9.002390026580542 524.0793236298487 7.753994490485638 592.1044679707848 45 1.3146785204298794 1.1710103870804793
46 8192 8 8192 2 7.756220467854291 8.035426988499239 8.864110975991935 518.9726910321042 7.770269992761314 581.98908099439 41 1.1078905081376433 1.1428389655411813

BIN
figs/mb1_interference.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 122 KiB

BIN
figs/pd_cost_vs_benefit.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

View File

@@ -0,0 +1,98 @@
#!/usr/bin/env python3
"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT.
The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be
misleading: chunked-prefill schedules decode alongside each prefill chunk,
so most decode-token intervals during the prefill burst look "normal" — but
each chunk completion creates a long-stall token. p50 hides this, p90
exposes it, but the BEST single-number summary of "how much was decode
slowed by prefill" is the *effective TPOT during the prefill burst*:
effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
i.e. wall-clock time divided by per-stream tokens emitted in that window.
This captures the true average throughput of each decode stream while a
prefill burst is underway. Compared to baseline_TPOT it gives the
"phase-interference penalty" PD-disagg could in principle recover.
"""
from __future__ import annotations
import argparse
import csv
import json
import statistics
from collections import defaultdict
from pathlib import Path
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--summary", type=Path, required=True)
p.add_argument("--out", type=Path, required=True)
args = p.parse_args()
rows = list(csv.DictReader(args.summary.open()))
by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list)
for r in rows:
D = int(r["decode_batch_size"])
P = int(r["new_prefill_tokens"])
by_dp[(D, P)].append(r)
summary = []
for (D, P) in sorted(by_dp):
rs = by_dp[(D, P)]
base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs)
during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs
if float(r["tpot_during_prefill_p50_ms"]) > 0]
during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs
if float(r["tpot_during_prefill_p90_ms"]) > 0]
ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs]
n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs
if float(r["num_tokens_during_prefill"]) > 0]
if not n_tok_vals or D == 0:
continue
ttft = statistics.mean(ttft_vals)
n_tok_total = statistics.mean(n_tok_vals)
per_stream_tokens = n_tok_total / D
eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0
penalty_x = eff_tpot_during / base if base > 0 else 0
# PD-disagg potential benefit (per stream, ms):
# if decode ran at baseline rate throughout the prefill window,
# it would emit ttft/baseline tokens. Actual is per_stream_tokens.
# Time saved if no interference = ttft - per_stream_tokens * baseline
time_saved_per_stream = ttft - per_stream_tokens * base
summary.append({
"decode_batch_size": D,
"new_prefill_tokens": P,
"baseline_tpot_ms": round(base, 2),
"during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2)
if during_p50_vals else None),
"during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2)
if during_p90_vals else None),
"prefill_ttft_ms": round(ttft, 1),
"num_tokens_during_prefill_total": round(n_tok_total, 1),
"per_stream_tokens_during": round(per_stream_tokens, 2),
"effective_tpot_during_ms": round(eff_tpot_during, 1),
"interference_penalty_x": round(penalty_x, 1),
"max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1),
})
args.out.parent.mkdir(parents=True, exist_ok=True)
args.out.write_text(json.dumps({"summary": summary}, indent=2))
print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} "
f"{'penalty':>10} {'pd_benefit_ms':>15}")
for s in summary:
print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} "
f"{s['baseline_tpot_ms']:>9.2f} "
f"{s['effective_tpot_during_ms']:>15.1f} "
f"{s['interference_penalty_x']:>9.1f}x "
f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}")
print(f"\nwrote {args.out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,422 @@
#!/usr/bin/env python3
"""Prefill-Decode Interference Microbenchmark Driver.
Measures TPOT degradation caused by prefill chunks interfering with ongoing decode batches.
Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) -> TPOT_penalty_ms
Usage:
python driver.py --host 127.0.0.1 --port 8000 \
--decode-batch-sizes 0,1,2,4,6,8,12 \
--prefill-tokens 512,1024,2048,4096,8192,16384,32768 \
--reps 5 --output-dir results/
"""
import argparse
import asyncio
import json
import os
import time
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Optional
import httpx
import numpy as np
FIXED_SEED_PROMPT = (
"You are a helpful assistant. Please analyze the following document carefully "
"and provide a comprehensive summary covering all key points, main arguments, "
"supporting evidence, and conclusions. The document discusses various aspects "
"of distributed systems, including consensus protocols, fault tolerance mechanisms, "
"and performance optimization strategies for large-scale deployments.\n\n"
) * 50 # ~4k tokens worth of repeated text for prefix cache sharing
WARMUP_TOKENS = 32
MEASURE_WINDOW_TOKENS = 500
@dataclass
class Config:
decode_batch_size: int
new_prefill_tokens: int
chunk_size: int
model: str
repetition: int
@dataclass
class BaselineResult:
tpot_p50_ms: float
tpot_p90_ms: float
tpot_p99_ms: float
tokens_collected: int
@dataclass
class InterferenceResult:
tpot_during_prefill_p50_ms: float
tpot_during_prefill_p90_ms: float
tpot_after_prefill_p50_ms: float
prefill_ttft_ms: float
num_tokens_during_prefill: int
async def stream_tokens(client: httpx.AsyncClient, url: str, payload: dict) -> list[float]:
"""Send a streaming request, return list of timestamps (seconds) for each token."""
timestamps = []
async with client.stream("POST", url, json=payload, timeout=300.0) as resp:
resp.raise_for_status()
async for line in resp.aiter_lines():
if line.startswith("data: "):
data = line[6:]
if data.strip() == "[DONE]":
break
try:
chunk = json.loads(data)
choices = chunk.get("choices", [])
if not choices:
continue
delta = choices[0].get("delta", {})
if "role" in delta:
continue
timestamps.append(time.perf_counter())
except json.JSONDecodeError:
continue
return timestamps
def compute_tpot(timestamps: list[float], skip_first: int = 0) -> np.ndarray:
"""Compute inter-token intervals in ms, skipping first N tokens."""
if len(timestamps) < skip_first + 2:
return np.array([])
ts = np.array(timestamps[skip_first:])
return np.diff(ts) * 1000.0 # seconds → ms
def make_decode_payload(model: str) -> dict:
return {
"model": model,
"messages": [{"role": "user", "content": FIXED_SEED_PROMPT}],
"max_tokens": WARMUP_TOKENS + MEASURE_WINDOW_TOKENS + 50,
"temperature": 0,
"stream": True,
}
def make_prefill_payload(model: str, num_tokens: int) -> dict:
import hashlib
import uuid
# Generate UNIQUE content every call to guarantee zero prefix cache hits.
# Calibration: each "Block N: <32-hex>" → ~35 tokens after tokenization
unique_id = f"{uuid.uuid4().hex}_{time.time_ns()}"
n_parts = max(1, num_tokens // 35)
content_parts = []
for i in range(n_parts):
seed = hashlib.md5(f"{unique_id}_{i}".encode()).hexdigest()
content_parts.append(f"Block {i}: {seed}")
content = " ".join(content_parts)
return {
"model": model,
"messages": [{"role": "user", "content": content}],
"max_tokens": 1,
"temperature": 0,
"stream": True,
}
async def wait_for_steady_state(decode_streams: list[asyncio.Task], min_tokens: int = 32):
"""Wait until all decode streams have emitted at least min_tokens."""
# We don't directly control this — we wait a fixed time based on expected TPOT
# At ~50ms/token, 32 tokens ≈ 1.6s. Wait 3s to be safe.
await asyncio.sleep(3.0)
async def run_baseline(
client: httpx.AsyncClient, url: str, model: str, decode_batch_size: int
) -> Optional[BaselineResult]:
"""Measure decode-only TPOT (no prefill interference)."""
if decode_batch_size == 0:
return BaselineResult(tpot_p50_ms=0, tpot_p90_ms=0, tpot_p99_ms=0, tokens_collected=0)
payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
tasks = [asyncio.create_task(stream_tokens(client, url, p)) for p in payloads]
all_timestamps = await asyncio.gather(*tasks, return_exceptions=True)
all_tpots = []
for ts in all_timestamps:
if isinstance(ts, Exception):
print(f" [WARN] decode stream error: {ts}")
continue
tpot = compute_tpot(ts, skip_first=WARMUP_TOKENS)
if len(tpot) > 0:
all_tpots.extend(tpot.tolist())
if not all_tpots:
return None
arr = np.array(all_tpots)
return BaselineResult(
tpot_p50_ms=float(np.percentile(arr, 50)),
tpot_p90_ms=float(np.percentile(arr, 90)),
tpot_p99_ms=float(np.percentile(arr, 99)),
tokens_collected=len(arr),
)
async def run_interference(
client: httpx.AsyncClient,
url: str,
model: str,
decode_batch_size: int,
new_prefill_tokens: int,
) -> Optional[InterferenceResult]:
"""Measure TPOT while a prefill request is being processed."""
if decode_batch_size == 0:
# No decode to interfere with; just measure prefill TTFT
prefill_payload = make_prefill_payload(model, new_prefill_tokens)
t_start = time.perf_counter()
ts = await stream_tokens(client, url, prefill_payload)
prefill_ttft = (ts[0] - t_start) * 1000.0 if ts else 0
return InterferenceResult(
tpot_during_prefill_p50_ms=0,
tpot_during_prefill_p90_ms=0,
tpot_after_prefill_p50_ms=0,
prefill_ttft_ms=prefill_ttft,
num_tokens_during_prefill=0,
)
# Phase 1: Start decode streams
decode_payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
decode_timestamps: list[list[float]] = [[] for _ in range(decode_batch_size)]
prefill_done_event = asyncio.Event()
prefill_ttft_ms = 0.0
prefill_inject_time = 0.0
async def decode_stream_with_tracking(idx: int, payload: dict):
timestamps = await stream_tokens(client, url, payload)
decode_timestamps[idx] = timestamps
async def prefill_after_warmup():
nonlocal prefill_ttft_ms, prefill_inject_time
# Wait for decode streams to stabilize
await asyncio.sleep(1.0)
prefill_inject_time = time.perf_counter()
prefill_payload = make_prefill_payload(model, new_prefill_tokens)
ts = await stream_tokens(client, url, prefill_payload)
if ts:
prefill_ttft_ms = (ts[0] - prefill_inject_time) * 1000.0
prefill_done_event.set()
# Launch all
decode_tasks = [
asyncio.create_task(decode_stream_with_tracking(i, p))
for i, p in enumerate(decode_payloads)
]
prefill_task = asyncio.create_task(prefill_after_warmup())
await asyncio.gather(*decode_tasks, prefill_task, return_exceptions=True)
# Analyze: split decode tokens into "during prefill" and "after prefill"
prefill_end_time = prefill_inject_time + prefill_ttft_ms / 1000.0
tpot_during = []
tpot_after = []
for ts_list in decode_timestamps:
if len(ts_list) < WARMUP_TOKENS + 5:
continue
for i in range(WARMUP_TOKENS + 1, len(ts_list)):
t_prev = ts_list[i - 1]
t_curr = ts_list[i]
interval_ms = (t_curr - t_prev) * 1000.0
if prefill_inject_time <= t_prev <= prefill_end_time:
tpot_during.append(interval_ms)
elif t_curr > prefill_end_time + 0.05: # 50ms after prefill settles
tpot_after.append(interval_ms)
during_arr = np.array(tpot_during) if tpot_during else np.array([0.0])
after_arr = np.array(tpot_after) if tpot_after else np.array([0.0])
return InterferenceResult(
tpot_during_prefill_p50_ms=float(np.percentile(during_arr, 50)),
tpot_during_prefill_p90_ms=float(np.percentile(during_arr, 90)),
tpot_after_prefill_p50_ms=float(np.percentile(after_arr, 50)),
prefill_ttft_ms=prefill_ttft_ms,
num_tokens_during_prefill=len(tpot_during),
)
async def run_single_config(
client: httpx.AsyncClient,
url: str,
model: str,
decode_batch_size: int,
new_prefill_tokens: int,
chunk_size: int,
rep: int,
output_dir: Path,
):
"""Run one (D, P) configuration."""
config = Config(
decode_batch_size=decode_batch_size,
new_prefill_tokens=new_prefill_tokens,
chunk_size=chunk_size,
model=model,
repetition=rep,
)
print(f" [rep {rep}] Running baseline (D={decode_batch_size})...")
baseline = await run_baseline(client, url, model, decode_batch_size)
if baseline is None:
print(f" [rep {rep}] Baseline failed, skipping")
return
# Brief cooldown between baseline and interference
await asyncio.sleep(2.0)
print(f" [rep {rep}] Running interference (D={decode_batch_size}, P={new_prefill_tokens})...")
interference = await run_interference(
client, url, model, decode_batch_size, new_prefill_tokens
)
if interference is None:
print(f" [rep {rep}] Interference measurement failed, skipping")
return
# Compute derived metrics
tpot_penalty_p50 = interference.tpot_during_prefill_p50_ms - baseline.tpot_p50_ms
penalty_ratio = (
interference.tpot_during_prefill_p50_ms / baseline.tpot_p50_ms
if baseline.tpot_p50_ms > 0 else 0
)
result = {
"config": asdict(config),
"baseline": asdict(baseline),
"interference": asdict(interference),
"derived": {
"tpot_penalty_p50_ms": tpot_penalty_p50,
"tpot_penalty_ratio": penalty_ratio,
},
}
# Save
fname = f"D{decode_batch_size}_P{new_prefill_tokens}_rep{rep}.json"
out_path = output_dir / fname
out_path.write_text(json.dumps(result, indent=2))
print(f" [rep {rep}] Done. penalty={tpot_penalty_p50:.1f}ms ratio={penalty_ratio:.2f}")
async def main():
parser = argparse.ArgumentParser(description="Prefill-Decode Interference Microbenchmark")
parser.add_argument("--host", default="127.0.0.1")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument("--model", default="Qwen3-Coder-30B-A3B-Instruct")
parser.add_argument("--decode-batch-sizes", default="0,1,2,4,6,8,12",
help="Comma-separated decode batch sizes")
parser.add_argument("--prefill-tokens", default="512,1024,2048,4096,8192,16384,32768",
help="Comma-separated prefill token counts")
parser.add_argument("--chunk-size", type=int, default=8192,
help="vLLM max_num_batched_tokens (effective chunk size)")
parser.add_argument("--reps", type=int, default=5)
parser.add_argument("--output-dir", default="results/interference")
args = parser.parse_args()
decode_sizes = [int(x) for x in args.decode_batch_sizes.split(",")]
prefill_tokens = [int(x) for x in args.prefill_tokens.split(",")]
output_dir = Path(args.output_dir) / f"chunk{args.chunk_size}"
output_dir.mkdir(parents=True, exist_ok=True)
url = f"http://{args.host}:{args.port}/v1/chat/completions"
print(f"Target: {url}")
print(f"Model: {args.model}")
print(f"Chunk size: {args.chunk_size}")
print(f"Decode batch sizes: {decode_sizes}")
print(f"Prefill tokens: {prefill_tokens}")
print(f"Repetitions: {args.reps}")
print(f"Output: {output_dir}")
print()
async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client:
# Sanity check: is the server up?
try:
resp = await client.get(f"http://{args.host}:{args.port}/v1/models")
resp.raise_for_status()
models = resp.json()
print(f"Server ready. Models: {[m['id'] for m in models.get('data', [])]}")
except Exception as e:
print(f"ERROR: Cannot reach server at {args.host}:{args.port}: {e}")
return
total_configs = len(decode_sizes) * len(prefill_tokens)
done = 0
for D in decode_sizes:
for P in prefill_tokens:
done += 1
print(f"\n[{done}/{total_configs}] D={D}, P={P}")
for rep in range(args.reps):
try:
await run_single_config(
client, url, args.model, D, P,
args.chunk_size, rep, output_dir,
)
except Exception as e:
print(f" [rep {rep}] ERROR: {e}")
# Cooldown between reps
await asyncio.sleep(1.0)
# Cooldown between configs
await asyncio.sleep(3.0)
print("\n\nDone! Results in:", output_dir)
# Generate summary CSV
await generate_summary(output_dir, args.chunk_size)
async def generate_summary(output_dir: Path, chunk_size: int):
"""Aggregate all per-run JSONs into a summary CSV."""
import csv
rows = []
for f in sorted(output_dir.glob("D*_P*_rep*.json")):
data = json.loads(f.read_text())
cfg = data["config"]
bl = data["baseline"]
itf = data["interference"]
drv = data["derived"]
rows.append({
"chunk_size": cfg["chunk_size"],
"decode_batch_size": cfg["decode_batch_size"],
"new_prefill_tokens": cfg["new_prefill_tokens"],
"repetition": cfg["repetition"],
"tpot_baseline_p50_ms": bl["tpot_p50_ms"],
"tpot_baseline_p90_ms": bl["tpot_p90_ms"],
"tpot_during_prefill_p50_ms": itf["tpot_during_prefill_p50_ms"],
"tpot_during_prefill_p90_ms": itf["tpot_during_prefill_p90_ms"],
"tpot_after_prefill_p50_ms": itf["tpot_after_prefill_p50_ms"],
"prefill_ttft_ms": itf["prefill_ttft_ms"],
"num_tokens_during_prefill": itf["num_tokens_during_prefill"],
"tpot_penalty_p50_ms": drv["tpot_penalty_p50_ms"],
"tpot_penalty_ratio": drv["tpot_penalty_ratio"],
})
if not rows:
return
csv_path = output_dir / "summary.csv"
with open(csv_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print(f"Summary CSV written: {csv_path}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,80 @@
#!/usr/bin/env bash
# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference).
# No kv_connector — MB1 measures intra-GPU phase interference, not transfer.
# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime
# we want to characterize: how much benefit can PD-disagg buy on top of
# the existing chunked-prefill colocated baseline?).
#
# Usage:
# GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start
# bash mb1_launch.sh status
# bash mb1_launch.sh stop
set -eo pipefail
FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
VENV="${FRESH_ROOT}/.venv"
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}"
GPU="${GPU:-0}"
PORT="${PORT:-8000}"
MASTER="${MASTER:-29500}"
# max_num_batched_tokens — controls the chunked-prefill chunk granularity.
# vLLM 0.18.1 default is 8192; we keep that as the headline run and
# optionally repeat at 32768 to expose the chunk-size effect.
CHUNK_TOKENS="${CHUNK_TOKENS:-8192}"
mkdir -p "${LOGS_DIR}"
stop_local() {
pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true
pkill -9 -f "EngineCore" 2>/dev/null || true
sleep 2
}
case "${1:-start}" in
stop)
stop_local; exit 0;;
status)
if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then
echo "port ${PORT}: UP"
else
echo "port ${PORT}: DOWN"
fi
exit 0;;
start) ;;
*) echo "Unknown command: $1"; exit 1;;
esac
stop_local
source "${VENV}/bin/activate"
echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)"
PYTHONHASHSEED=42 \
CUDA_VISIBLE_DEVICES="${GPU}" \
MASTER_PORT="${MASTER}" \
nohup vllm serve "${MODEL}" \
--host 0.0.0.0 --port "${PORT}" \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching \
--dtype auto --gpu-memory-utilization 0.9 \
--max-model-len 200000 \
--max-num-batched-tokens "${CHUNK_TOKENS}" \
--enable-prompt-tokens-details \
> "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 &
disown
echo "[mb1] waiting for /health on port ${PORT}..."
tries=0
while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do
tries=$((tries+1))
if [ ${tries} -gt 180 ]; then
echo "[mb1] FATAL port ${PORT} did not come up in 6 min"
tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true
exit 1
fi
sleep 2
done
echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})"

View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
Two outputs:
mb1_interference.png
Effective TPOT during prefill vs prefill size, one line per D.
Log-log. Annotates typical agentic decode duration (~100 ms) as a
horizontal band so reader can spot when decode would be stalled.
pd_cost_vs_benefit.png
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
- benefit ceiling (MB1) — at most one decode-duration per request
of phase isolation can be recovered. Drawn as a flat 100 ms line.
- cost (MB2) — Mooncake pure_transfer p50 at that size.
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
structurally loses.
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--mb1", type=Path, required=True)
p.add_argument("--mb2-intra", type=Path, required=True)
p.add_argument("--mb2-inter", type=Path, default=None)
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
args = p.parse_args()
mb1 = json.loads(args.mb1.read_text())["summary"]
# ---- mb1_interference.png ----
fig, ax = plt.subplots(figsize=(9, 5.5))
Ds = sorted({s["decode_batch_size"] for s in mb1})
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
for D in Ds:
rows = [s for s in mb1 if s["decode_batch_size"] == D]
rows.sort(key=lambda s: s["new_prefill_tokens"])
xs = [s["new_prefill_tokens"] for s in rows]
ys = [s["effective_tpot_during_ms"] for s in rows]
ax.plot(xs, ys, "o-", lw=2, markersize=7,
color=colors.get(D, "gray"),
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
(100, "agentic decode (~100 ms)"),
(200, "long agentic decode (~200 ms)")]:
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlabel("Prefill burst size (tokens, log)")
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
print(f"wrote {args.out_interf}")
# ---- pd_cost_vs_benefit.png ----
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
fig, ax = plt.subplots(figsize=(9, 5.5))
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
if args.mb2_inter:
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
inter_x = [s["kv_mib"] for s in mb2_inter]
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
alpha=0.7, label="MB2 inter-node (same numbers)")
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
label="MB1 max benefit ≤ agentic decode (~100 ms)")
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
label="benefit range (50200 ms decode)")
# Mark agentic-tail request sizes
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
(3072, "p90\n(~33k tok)"),
(6144, "p95\n(~65k tok)"),
(11500, "p99\n(11.5 GiB)")]:
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
ha="center", va="bottom")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlim(40, 14000)
ax.set_ylim(1, 12000)
ax.set_xlabel("Per-request KV size (MiB, log)")
ax.set_ylabel("Time per request (ms, log)")
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
# Add explanatory annotation
ax.text(10000, 5000,
"Cost > benefit for ANY KV size above\n"
"the green band (~80 MiB / ~830 tokens).\n"
"Below: cost is marginal (<10 ms) but\n"
"benefit is also small (decode is short).",
fontsize=9, color="#333",
ha="right", va="top",
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
print(f"wrote {args.out_cb}")
if __name__ == "__main__":
main()