MB1: prefill-decode interference under chunked-prefill default; §3.2 headline
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
177
analysis/mb1/README.md
Normal file
177
analysis/mb1/README.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# MB1 — Prefill–Decode Interference (chunked-prefill on, vLLM 0.18.1 default)
|
||||
|
||||
Persistent record of the phase-interference microbench used to put a
|
||||
quantitative upper bound on **what PD-disaggregation can buy** under the
|
||||
chunked-prefill-on baseline. Re-runs append a dated section at the
|
||||
bottom; the **Summary** block is what gets cited.
|
||||
|
||||
---
|
||||
|
||||
## Summary (latest)
|
||||
|
||||
| Headline | Value |
|
||||
|---|---|
|
||||
| Baseline single-stream TPOT (D=1, idle GPU) | **4.8 ms** |
|
||||
| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
|
||||
| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
|
||||
| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
|
||||
| Maximum PD-disagg benefit per agentic decode | **≤ 50–200 ms** (= decode duration) |
|
||||
|
||||
**§3.2 headline (cost vs benefit, this run + MB2)**:
|
||||
|
||||
> Under chunked-prefill, every ongoing decode stream is essentially
|
||||
> **halted while a prefill chunk is in flight** — per-stream effective
|
||||
> TPOT during the burst is 15× to 2000× baseline, scaling with prefill
|
||||
> size. PD-disagg can recover this stall, but the recovery is bounded by
|
||||
> the **decode duration** of the request being protected. For agentic,
|
||||
> decode is 50–200 ms (tool-call output). MB2 shows PD-disagg pays
|
||||
> 300 ms – 10 s of KV-transfer cost per request to do that recovery. The
|
||||
> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB
|
||||
> (~830 tokens) — well below all agentic operating points. The benefit
|
||||
> never beats the cost in this workload.
|
||||
|
||||
## Setup
|
||||
|
||||
| Component | Value |
|
||||
|---|---|
|
||||
| Host | dash1, H20 96 GiB, driver 570.133.20 |
|
||||
| Venv | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` |
|
||||
| vLLM | 0.18.1 official wheel (chunked-prefill default-on, V1 engine) |
|
||||
| Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` |
|
||||
| Launch flags | `--tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192` |
|
||||
| kv_connector | **none** (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2) |
|
||||
|
||||
## Method
|
||||
|
||||
Adapted from `microbench/interference/driver.py`:
|
||||
|
||||
1. Start D streaming decode requests on `/v1/chat/completions` with a
|
||||
long max_tokens cap. Discard the first 32 tokens as warmup.
|
||||
2. After 1 s, inject one prefill-only request with `max_tokens=1` and
|
||||
an input of `P` synthetic tokens (uuid-seeded for zero prefix-cache
|
||||
reuse). Measure the prefill's TTFT.
|
||||
3. Bin the *during-prefill* tokens from each decode stream by whether
|
||||
their wall-clock falls inside `[prefill_inject_ts, prefill_inject_ts +
|
||||
prefill_ttft]`. Report inter-token p50 / p90.
|
||||
4. Bin a baseline run (D streams, no prefill injection) the same way.
|
||||
|
||||
We additionally compute the **effective per-stream TPOT during the
|
||||
prefill burst** as the single most informative summary:
|
||||
|
||||
```
|
||||
eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
|
||||
```
|
||||
|
||||
This is the average rate at which each decode stream produces tokens
|
||||
while a prefill is in flight. Compared to baseline TPOT it gives the
|
||||
real per-stream throughput penalty (chunked-prefill p50 looks deceptively
|
||||
fine because most decode-token intervals during the burst are at normal
|
||||
speed; p90 sees the stall but is itself noisy; the effective TPOT is
|
||||
the cleanest "average over the whole burst window" number).
|
||||
|
||||
## Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192
|
||||
|
||||
3 D × 5 P × 3 reps. Aggregated by `analyze_mb1.py`.
|
||||
|
||||
| D | P (tok) | base TPOT (ms) | prefill_ttft (ms) | per-stream tokens during | effective TPOT during (ms) | penalty | max PD-disagg benefit per stream (ms) |
|
||||
|--:|--:|--:|--:|--:|--:|--:|--:|
|
||||
| 1 | 2 048 | 4.79 | 163 | 4.0 | 41 | 8× | 144 |
|
||||
| 1 | 8 192 | 4.78 | 584 | 5.0 | 117 | 24× | 560 |
|
||||
| 1 | 32 768 | 4.78 | 4 515 | 5.0 | 903 | 189× | 4 491 |
|
||||
| 1 | 65 536 | 4.78 | 15 568 | 5.3 | 2 919 | 610× | 15 542 |
|
||||
| 1 | 131 072 | 4.78 | 56 765 | 5.7 | 10 017 | 2 094× | 56 738 |
|
||||
| 4 | 2 048 | 5.62 | 138 | 3.9 | 36 | 6× | 117 |
|
||||
| 4 | 8 192 | 6.08 | 574 | 4.5 | 128 | 21× | 547 |
|
||||
| 4 | 32 768 | 6.09 | 4 529 | 11.9 | 381 | 63× | 4 457 |
|
||||
| 4 | 65 536 | 5.85 | 15 587 | 19.8 | 789 | 135× | 15 471 |
|
||||
| 4 | 131 072 | 6.27 | 56 697 | 37.4 | 1 517 | 242× | 56 463 |
|
||||
| 8 | 2 048 | 7.71 | 143 | 4.5 | 32 | 4× | 109 |
|
||||
| 8 | 8 192 | 7.69 | 583 | 5.1 | 114 | 15× | 544 |
|
||||
| 8 | 32 768 | 7.42 | 4 520 | 11.7 | 387 | 52× | 4 434 |
|
||||
| 8 | 65 536 | 7.67 | 15 615 | 20.6 | 757 | 99× | 15 457 |
|
||||
| 8 | 131 072 | 7.74 | 56 991 | 40.2 | 1 419 | 183× | 56 680 |
|
||||
|
||||
**Reading the table**:
|
||||
|
||||
- *Baseline TPOT* grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8).
|
||||
Multi-stream decoding has small but nonzero contention even without
|
||||
prefill.
|
||||
- *Effective TPOT during* grows mostly with P: a single 8k prefill stalls
|
||||
decode for ~580 ms regardless of D, so each stream emits only a handful
|
||||
of tokens during that 580 ms window — effective per-stream TPOT
|
||||
collapses to 100–130 ms. Larger prefill = more chunks = larger stall.
|
||||
- *Penalty* is the eff/baseline ratio. Above 50× for P ≥ 32k. Above
|
||||
500× for D=1 at P ≥ 65k.
|
||||
- *Max PD-disagg benefit per stream* = `prefill_ttft − per_stream_tokens
|
||||
× baseline_TPOT` ≈ `prefill_ttft` (since interference essentially
|
||||
halts decode). This is the entire prefill duration's worth of decode
|
||||
time that could in principle be recovered.
|
||||
|
||||
Two big caveats for **agentic** application:
|
||||
|
||||
1. **Decode is short** (~50–200 ms for tool-call output). The actual
|
||||
recoverable benefit per request is bounded by the decode duration,
|
||||
not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill
|
||||
collides with it, PD-disagg can save at most 100 ms — not 5 s.
|
||||
2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms – 10 s per request
|
||||
for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
|
||||
~100 ms benefit ceiling. Cost > benefit across the whole agentic
|
||||
distribution.
|
||||
|
||||
## §3.2 cost-vs-benefit figure
|
||||
|
||||
`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50–200 ms
|
||||
band, capped by decode duration) on top of MB2 transfer cost curve. The
|
||||
cost curve crosses the benefit ceiling somewhere around **80 MiB / 830
|
||||
tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace
|
||||
mean per request KV, and we know agentic averages 33k tokens, p99
|
||||
125k). For anything bigger PD-disagg pays more than it can recover.
|
||||
|
||||
## Reproduction
|
||||
|
||||
```bash
|
||||
# vllm pair-free single-instance launch
|
||||
ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
|
||||
bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'
|
||||
|
||||
# sweep
|
||||
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
|
||||
python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
|
||||
--host 127.0.0.1 --port 8000 \
|
||||
--model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
|
||||
--decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
|
||||
--reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'
|
||||
|
||||
# pull + analyze
|
||||
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
|
||||
analysis/mb1/summary.csv
|
||||
.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
|
||||
--summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
|
||||
.venv/bin/python microbench/fresh_setup/plot_mb1.py \
|
||||
--mb1 analysis/mb1/breakdown.json \
|
||||
--mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
|
||||
--mb2-inter analysis/mb2/inter_kvboth_breakdown.json
|
||||
|
||||
# teardown
|
||||
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'
|
||||
```
|
||||
|
||||
## Open questions / next runs
|
||||
|
||||
- **Chunk size sensitivity**: this run uses `--max-num-batched-tokens
|
||||
8192`. Sarathi-Serve goes smaller (e.g. 1024) and recovers more
|
||||
decode interleaving inside each prefill burst. Worth running
|
||||
chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis.
|
||||
- **Higher D**: 12, 16 streams to see whether the penalty saturates or
|
||||
keeps shrinking per-stream.
|
||||
- **Cross-validate effective_TPOT_during with token-time-series plot**:
|
||||
raw per-token timestamps could reveal whether the stall is a few big
|
||||
spikes or many small ones (currently inferred from p50/p90 spread).
|
||||
|
||||
## Run log
|
||||
|
||||
### 2026-05-27 — dash1 GPU 0, chunk_tokens=8192
|
||||
|
||||
3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
|
||||
dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
|
||||
Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`.
|
||||
199
analysis/mb1/breakdown.json
Normal file
199
analysis/mb1/breakdown.json
Normal file
@@ -0,0 +1,199 @@
|
||||
{
|
||||
"summary": [
|
||||
{
|
||||
"decode_batch_size": 1,
|
||||
"new_prefill_tokens": 2048,
|
||||
"baseline_tpot_ms": 4.79,
|
||||
"during_tpot_p50_ms_raw": 35.43,
|
||||
"during_tpot_p90_ms_raw": 79.91,
|
||||
"prefill_ttft_ms": 163.3,
|
||||
"num_tokens_during_prefill_total": 4.0,
|
||||
"per_stream_tokens_during": 4.0,
|
||||
"effective_tpot_during_ms": 40.8,
|
||||
"interference_penalty_x": 8.5,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 144.2
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 1,
|
||||
"new_prefill_tokens": 8192,
|
||||
"baseline_tpot_ms": 4.78,
|
||||
"during_tpot_p50_ms_raw": 6.56,
|
||||
"during_tpot_p90_ms_raw": 328.57,
|
||||
"prefill_ttft_ms": 583.9,
|
||||
"num_tokens_during_prefill_total": 5.0,
|
||||
"per_stream_tokens_during": 5.0,
|
||||
"effective_tpot_during_ms": 116.8,
|
||||
"interference_penalty_x": 24.4,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 560.0
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 1,
|
||||
"new_prefill_tokens": 32768,
|
||||
"baseline_tpot_ms": 4.78,
|
||||
"during_tpot_p50_ms_raw": 4.75,
|
||||
"during_tpot_p90_ms_raw": 4.9,
|
||||
"prefill_ttft_ms": 4515.3,
|
||||
"num_tokens_during_prefill_total": 5.0,
|
||||
"per_stream_tokens_during": 5.0,
|
||||
"effective_tpot_during_ms": 903.1,
|
||||
"interference_penalty_x": 188.8,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 4491.4
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 1,
|
||||
"new_prefill_tokens": 65536,
|
||||
"baseline_tpot_ms": 4.78,
|
||||
"during_tpot_p50_ms_raw": 4.69,
|
||||
"during_tpot_p90_ms_raw": 4.97,
|
||||
"prefill_ttft_ms": 15567.6,
|
||||
"num_tokens_during_prefill_total": 5.3,
|
||||
"per_stream_tokens_during": 5.33,
|
||||
"effective_tpot_during_ms": 2918.9,
|
||||
"interference_penalty_x": 610.2,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 15542.0
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 1,
|
||||
"new_prefill_tokens": 131072,
|
||||
"baseline_tpot_ms": 4.78,
|
||||
"during_tpot_p50_ms_raw": 4.71,
|
||||
"during_tpot_p90_ms_raw": 4.9,
|
||||
"prefill_ttft_ms": 56765.2,
|
||||
"num_tokens_during_prefill_total": 5.7,
|
||||
"per_stream_tokens_during": 5.67,
|
||||
"effective_tpot_during_ms": 10017.4,
|
||||
"interference_penalty_x": 2094.5,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 56738.1
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 4,
|
||||
"new_prefill_tokens": 2048,
|
||||
"baseline_tpot_ms": 5.62,
|
||||
"during_tpot_p50_ms_raw": 22.18,
|
||||
"during_tpot_p90_ms_raw": 84.85,
|
||||
"prefill_ttft_ms": 138.3,
|
||||
"num_tokens_during_prefill_total": 15.5,
|
||||
"per_stream_tokens_during": 3.88,
|
||||
"effective_tpot_during_ms": 35.7,
|
||||
"interference_penalty_x": 6.3,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 116.6
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 4,
|
||||
"new_prefill_tokens": 8192,
|
||||
"baseline_tpot_ms": 6.08,
|
||||
"during_tpot_p50_ms_raw": 8.45,
|
||||
"during_tpot_p90_ms_raw": 515.39,
|
||||
"prefill_ttft_ms": 574.1,
|
||||
"num_tokens_during_prefill_total": 18.0,
|
||||
"per_stream_tokens_during": 4.5,
|
||||
"effective_tpot_during_ms": 127.6,
|
||||
"interference_penalty_x": 21.0,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 546.8
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 4,
|
||||
"new_prefill_tokens": 32768,
|
||||
"baseline_tpot_ms": 6.09,
|
||||
"during_tpot_p50_ms_raw": 9.83,
|
||||
"during_tpot_p90_ms_raw": 1314.87,
|
||||
"prefill_ttft_ms": 4529.1,
|
||||
"num_tokens_during_prefill_total": 47.5,
|
||||
"per_stream_tokens_during": 11.88,
|
||||
"effective_tpot_during_ms": 381.4,
|
||||
"interference_penalty_x": 62.7,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 4456.9
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 4,
|
||||
"new_prefill_tokens": 65536,
|
||||
"baseline_tpot_ms": 5.85,
|
||||
"during_tpot_p50_ms_raw": 6.41,
|
||||
"during_tpot_p90_ms_raw": 2077.47,
|
||||
"prefill_ttft_ms": 15586.5,
|
||||
"num_tokens_during_prefill_total": 79.0,
|
||||
"per_stream_tokens_during": 19.75,
|
||||
"effective_tpot_during_ms": 789.2,
|
||||
"interference_penalty_x": 135.0,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 15471.0
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 4,
|
||||
"new_prefill_tokens": 131072,
|
||||
"baseline_tpot_ms": 6.27,
|
||||
"during_tpot_p50_ms_raw": 6.3,
|
||||
"during_tpot_p90_ms_raw": 4405.18,
|
||||
"prefill_ttft_ms": 56697.1,
|
||||
"num_tokens_during_prefill_total": 149.5,
|
||||
"per_stream_tokens_during": 37.38,
|
||||
"effective_tpot_during_ms": 1517.0,
|
||||
"interference_penalty_x": 241.8,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 56462.6
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 8,
|
||||
"new_prefill_tokens": 2048,
|
||||
"baseline_tpot_ms": 7.71,
|
||||
"during_tpot_p50_ms_raw": 8.38,
|
||||
"during_tpot_p90_ms_raw": 98.98,
|
||||
"prefill_ttft_ms": 143.1,
|
||||
"num_tokens_during_prefill_total": 35.7,
|
||||
"per_stream_tokens_during": 4.46,
|
||||
"effective_tpot_during_ms": 32.1,
|
||||
"interference_penalty_x": 4.2,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 108.8
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 8,
|
||||
"new_prefill_tokens": 8192,
|
||||
"baseline_tpot_ms": 7.69,
|
||||
"during_tpot_p50_ms_raw": 9.34,
|
||||
"during_tpot_p90_ms_raw": 519.29,
|
||||
"prefill_ttft_ms": 583.3,
|
||||
"num_tokens_during_prefill_total": 41.0,
|
||||
"per_stream_tokens_during": 5.12,
|
||||
"effective_tpot_during_ms": 113.8,
|
||||
"interference_penalty_x": 14.8,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 543.9
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 8,
|
||||
"new_prefill_tokens": 32768,
|
||||
"baseline_tpot_ms": 7.42,
|
||||
"during_tpot_p50_ms_raw": 11.61,
|
||||
"during_tpot_p90_ms_raw": 1315.48,
|
||||
"prefill_ttft_ms": 4520.3,
|
||||
"num_tokens_during_prefill_total": 93.3,
|
||||
"per_stream_tokens_during": 11.67,
|
||||
"effective_tpot_during_ms": 387.5,
|
||||
"interference_penalty_x": 52.2,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 4433.7
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 8,
|
||||
"new_prefill_tokens": 65536,
|
||||
"baseline_tpot_ms": 7.67,
|
||||
"during_tpot_p50_ms_raw": 19.09,
|
||||
"during_tpot_p90_ms_raw": 2471.4,
|
||||
"prefill_ttft_ms": 15615.5,
|
||||
"num_tokens_during_prefill_total": 165.0,
|
||||
"per_stream_tokens_during": 20.62,
|
||||
"effective_tpot_during_ms": 757.1,
|
||||
"interference_penalty_x": 98.8,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 15457.4
|
||||
},
|
||||
{
|
||||
"decode_batch_size": 8,
|
||||
"new_prefill_tokens": 131072,
|
||||
"baseline_tpot_ms": 7.74,
|
||||
"during_tpot_p50_ms_raw": 11.51,
|
||||
"during_tpot_p90_ms_raw": 4895.27,
|
||||
"prefill_ttft_ms": 56991.4,
|
||||
"num_tokens_during_prefill_total": 321.3,
|
||||
"per_stream_tokens_during": 40.17,
|
||||
"effective_tpot_during_ms": 1418.9,
|
||||
"interference_penalty_x": 183.3,
|
||||
"max_pd_disagg_benefit_ms_per_stream": 56680.4
|
||||
}
|
||||
]
|
||||
}
|
||||
46
analysis/mb1/summary.csv
Normal file
46
analysis/mb1/summary.csv
Normal file
@@ -0,0 +1,46 @@
|
||||
chunk_size,decode_batch_size,new_prefill_tokens,repetition,tpot_baseline_p50_ms,tpot_baseline_p90_ms,tpot_during_prefill_p50_ms,tpot_during_prefill_p90_ms,tpot_after_prefill_p50_ms,prefill_ttft_ms,num_tokens_during_prefill,tpot_penalty_p50_ms,tpot_penalty_ratio
|
||||
8192,1,131072,0,4.777565016411245,4.900234832894057,4.701301048044115,4.948397364933044,0.0,56719.25117995124,7,-0.07626396836712956,0.9840370632099913
|
||||
8192,1,131072,1,4.779465030878782,4.883405601140112,4.707481013610959,4.85471700085327,0.0,56696.089847013354,5,-0.07198401726782322,0.9849388965495606
|
||||
8192,1,131072,2,4.790953011251986,4.880544205661863,4.728371975943446,4.907831805758178,0.0,56880.19039196661,5,-0.06258103530853987,0.9869376645603573
|
||||
8192,1,2048,0,4.77885699365288,4.894876398611814,41.434570477576926,88.97331730695441,0.0,183.2046649651602,4,36.655713483924046,8.670393471202205
|
||||
8192,1,2048,1,4.788161953911185,4.949774022679776,41.68213551747613,83.5143867880106,0.0,175.55483896285295,4,36.89397356356494,8.705247633369687
|
||||
8192,1,2048,2,4.7893429873511195,4.874200583435595,23.186982492916286,67.25202781381086,0.0,131.23180496040732,4,18.397639505565166,4.841370215946989
|
||||
8192,1,32768,0,4.789774015080184,4.870833398308605,4.738486022688448,4.886626999359578,0.0,4500.839321000967,5,-0.051287992391735315,0.9892921895207875
|
||||
8192,1,32768,1,4.776834975928068,4.891659819986671,4.729953012429178,4.9245511763729155,0.0,4496.073378017172,5,-0.0468819634988904,0.9901855593221991
|
||||
8192,1,32768,2,4.784431017469615,4.866032593417913,4.782894975505769,4.8977664206177,0.0,4549.013931944501,5,-0.0015360419638454914,0.9996789499193871
|
||||
8192,1,65536,0,4.778854956384748,4.9255444086156785,4.633405013009906,4.895579582080245,0.0,15530.37424501963,5,-0.1454499433748424,0.9695638506080803
|
||||
8192,1,65536,1,4.784283053595573,4.8808404128067195,4.754905996378511,4.985795798711479,0.0,15584.887631004676,5,-0.02937705721706152,0.99385967408534
|
||||
8192,1,65536,2,4.787993966601789,4.9004736240021884,4.6836750116199255,5.0271204963792115,0.0,15587.390075030271,6,-0.1043189549818635,0.9782123879625725
|
||||
8192,1,8192,0,4.785028984770179,4.878618801012635,7.490115996915847,324.06569679733366,0.0,573.2795029762201,5,2.7050870121456683,1.565323014919123
|
||||
8192,1,8192,1,4.778591974172741,4.899543372448534,5.9131429879926145,336.8099076091312,0.0,606.6823820001446,5,1.1345510138198733,1.237423705550061
|
||||
8192,1,8192,2,4.78826800826937,4.90188361145556,6.276679981965572,324.8370993998833,0.0,571.7499859747477,5,1.488411973696202,1.310845585736994
|
||||
8192,4,131072,0,6.113810988608748,6.309205386787653,0.0,0.0,0.0,56702.702289039735,0,-6.113810988608748,0.0
|
||||
8192,4,131072,1,6.630807969486341,7.086459483252838,6.2820459716022015,4400.500871409893,0.0,56807.70832300186,150,-0.3487619978841394,0.9474027902045915
|
||||
8192,4,131072,2,6.073819473385811,6.344516028184444,6.326125003397465,4409.856556192978,0.0,56580.784838995896,149,0.2523055300116539,1.0415398467335428
|
||||
8192,4,2048,0,5.402160517405719,5.543816485442221,6.210724503034726,84.62208869168535,6.125201500253752,140.3041940066032,18,0.8085639856290072,1.1496741873966574
|
||||
8192,4,2048,1,6.067108013667166,6.381415005307645,0.0,0.0,0.0,140.06177097326145,0,-6.067108013667166,0.0
|
||||
8192,4,2048,2,5.400336522143334,5.536347016459331,38.15686801681295,85.07051098858938,5.25214200024493,134.67552902875468,13,32.756531494669616,7.065646346363043
|
||||
8192,4,32768,0,6.115561525803059,6.369604001520202,7.216634490760043,1314.6978712815326,5.17624247004278,4522.433568025008,50,1.101072964956984,1.1800444587649532
|
||||
8192,4,32768,1,6.070095987524837,6.3612310332246125,0.0,0.0,0.0,4508.074064040557,0,-6.070095987524837,0.0
|
||||
8192,4,32768,2,6.0734800063073635,6.312666402664036,12.442811043001711,1315.0411327951588,4.754714027512819,4556.892123946454,45,6.369331036694348,2.0487119460473635
|
||||
8192,4,65536,0,5.406292999396101,5.540905491216108,0.0,0.0,0.0,15581.590663990937,0,-5.406292999396101,0.0
|
||||
8192,4,65536,1,6.076910009142011,6.315114628523588,0.0,0.0,0.0,15574.196094006766,0,-6.076910009142011,0.0
|
||||
8192,4,65536,2,6.060379033442587,6.384042033459991,6.411670008674264,2077.4700703914277,4.8022730043157935,15603.720718005206,79,0.3512909752316773,1.0579651822589267
|
||||
8192,4,8192,0,6.110575021011755,6.416070973500609,8.451583969872445,515.3855616226792,5.358011490898207,574.6672929963097,18,2.34100894886069,1.3831077993169092
|
||||
8192,4,8192,1,6.051429023500532,6.398122606333345,0.0,0.0,0.0,573.6081749782898,0,-6.051429023500532,0.0
|
||||
8192,4,8192,2,6.064729997888207,6.366449000779539,0.0,0.0,0.0,574.1707819979638,0,-6.064729997888207,0.0
|
||||
8192,8,131072,0,7.737616979284212,7.99839201499708,10.740376019384712,4742.438135773409,7.792441989295185,57010.66731195897,335,3.0027590401005,1.388072845701685
|
||||
8192,8,131072,1,7.744895527139306,8.013638522243127,8.647068490972742,5123.228083999129,7.672236970392987,56970.40947602363,310,0.9021729638334364,1.116486137310966
|
||||
8192,8,131072,2,7.740180502878502,8.016240986762568,15.140031988266855,4820.136589207682,7.68946303287521,56993.02393599646,319,7.3998514853883535,1.9560308680962177
|
||||
8192,8,2048,0,7.741285488009453,8.022559515666217,8.103576023131609,124.87094267853536,7.6825070136692375,141.97922096354887,30,0.36229053512215614,1.046799789993963
|
||||
8192,8,2048,1,7.728310010861605,8.021069981623441,8.17067950265482,84.82906777062453,7.745136506855488,144.1582590341568,38,0.4423694917932153,1.0572401328584768
|
||||
8192,8,2048,2,7.662211020942777,8.034424972720444,8.87883099494502,87.23540699575096,7.592331967316568,143.27958395006135,39,1.216619974002242,1.1587818412566437
|
||||
8192,8,32768,0,7.295333489309996,7.422819995554164,11.429400008637458,1315.43214758276,7.8034960315562785,4523.641717038117,94,4.134066519327462,1.5666727265292526
|
||||
8192,8,32768,1,7.278127042809501,7.490781514206901,12.640403030673042,1315.491412486881,7.821676495950669,4519.993302994408,90,5.362275987863541,1.736765922925357
|
||||
8192,8,32768,2,7.684049021918327,8.047712198458612,10.752685484476388,1315.5166705255397,7.80402502277866,4517.200137954205,96,3.068636462558061,1.3993514947399404
|
||||
8192,8,65536,0,7.708174001891166,8.017168991500512,26.662671996746212,2496.8427699001018,7.768569514155388,15603.601168957539,160,18.954497994855046,3.459012729889679
|
||||
8192,8,65536,1,7.594842027174309,7.9874323040712625,13.054963492322713,2459.1690181812737,7.54699349636212,15620.474929979537,174,5.460121465148404,1.7189249553331216
|
||||
8192,8,65536,2,7.693717983784154,7.933055714238435,17.5579380011186,2458.176895044744,7.808708498487249,15622.32490995666,161,9.864220017334446,2.2821135422594123
|
||||
8192,8,8192,0,7.636573514901102,7.904737605713308,10.151655005756766,514.8188057704829,7.7977380133233964,575.7745200535282,37,2.515081490855664,1.3293468577167538
|
||||
8192,8,8192,1,7.687711506150663,7.965393498307094,9.002390026580542,524.0793236298487,7.753994490485638,592.1044679707848,45,1.3146785204298794,1.1710103870804793
|
||||
8192,8,8192,2,7.756220467854291,8.035426988499239,8.864110975991935,518.9726910321042,7.770269992761314,581.98908099439,41,1.1078905081376433,1.1428389655411813
|
||||
|
BIN
figs/mb1_interference.png
Normal file
BIN
figs/mb1_interference.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 122 KiB |
BIN
figs/pd_cost_vs_benefit.png
Normal file
BIN
figs/pd_cost_vs_benefit.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 161 KiB |
98
microbench/fresh_setup/analyze_mb1.py
Normal file
98
microbench/fresh_setup/analyze_mb1.py
Normal file
@@ -0,0 +1,98 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT.
|
||||
|
||||
The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be
|
||||
misleading: chunked-prefill schedules decode alongside each prefill chunk,
|
||||
so most decode-token intervals during the prefill burst look "normal" — but
|
||||
each chunk completion creates a long-stall token. p50 hides this, p90
|
||||
exposes it, but the BEST single-number summary of "how much was decode
|
||||
slowed by prefill" is the *effective TPOT during the prefill burst*:
|
||||
|
||||
effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
|
||||
|
||||
i.e. wall-clock time divided by per-stream tokens emitted in that window.
|
||||
This captures the true average throughput of each decode stream while a
|
||||
prefill burst is underway. Compared to baseline_TPOT it gives the
|
||||
"phase-interference penalty" PD-disagg could in principle recover.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import statistics
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--summary", type=Path, required=True)
|
||||
p.add_argument("--out", type=Path, required=True)
|
||||
args = p.parse_args()
|
||||
|
||||
rows = list(csv.DictReader(args.summary.open()))
|
||||
by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list)
|
||||
for r in rows:
|
||||
D = int(r["decode_batch_size"])
|
||||
P = int(r["new_prefill_tokens"])
|
||||
by_dp[(D, P)].append(r)
|
||||
|
||||
summary = []
|
||||
for (D, P) in sorted(by_dp):
|
||||
rs = by_dp[(D, P)]
|
||||
base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs)
|
||||
during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs
|
||||
if float(r["tpot_during_prefill_p50_ms"]) > 0]
|
||||
during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs
|
||||
if float(r["tpot_during_prefill_p90_ms"]) > 0]
|
||||
ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs]
|
||||
n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs
|
||||
if float(r["num_tokens_during_prefill"]) > 0]
|
||||
|
||||
if not n_tok_vals or D == 0:
|
||||
continue
|
||||
ttft = statistics.mean(ttft_vals)
|
||||
n_tok_total = statistics.mean(n_tok_vals)
|
||||
per_stream_tokens = n_tok_total / D
|
||||
eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0
|
||||
penalty_x = eff_tpot_during / base if base > 0 else 0
|
||||
|
||||
# PD-disagg potential benefit (per stream, ms):
|
||||
# if decode ran at baseline rate throughout the prefill window,
|
||||
# it would emit ttft/baseline tokens. Actual is per_stream_tokens.
|
||||
# Time saved if no interference = ttft - per_stream_tokens * baseline
|
||||
time_saved_per_stream = ttft - per_stream_tokens * base
|
||||
|
||||
summary.append({
|
||||
"decode_batch_size": D,
|
||||
"new_prefill_tokens": P,
|
||||
"baseline_tpot_ms": round(base, 2),
|
||||
"during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2)
|
||||
if during_p50_vals else None),
|
||||
"during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2)
|
||||
if during_p90_vals else None),
|
||||
"prefill_ttft_ms": round(ttft, 1),
|
||||
"num_tokens_during_prefill_total": round(n_tok_total, 1),
|
||||
"per_stream_tokens_during": round(per_stream_tokens, 2),
|
||||
"effective_tpot_during_ms": round(eff_tpot_during, 1),
|
||||
"interference_penalty_x": round(penalty_x, 1),
|
||||
"max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1),
|
||||
})
|
||||
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.out.write_text(json.dumps({"summary": summary}, indent=2))
|
||||
|
||||
print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} "
|
||||
f"{'penalty':>10} {'pd_benefit_ms':>15}")
|
||||
for s in summary:
|
||||
print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} "
|
||||
f"{s['baseline_tpot_ms']:>9.2f} "
|
||||
f"{s['effective_tpot_during_ms']:>15.1f} "
|
||||
f"{s['interference_penalty_x']:>9.1f}x "
|
||||
f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}")
|
||||
print(f"\nwrote {args.out}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
422
microbench/fresh_setup/mb1_driver.py
Normal file
422
microbench/fresh_setup/mb1_driver.py
Normal file
@@ -0,0 +1,422 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Prefill-Decode Interference Microbenchmark Driver.
|
||||
|
||||
Measures TPOT degradation caused by prefill chunks interfering with ongoing decode batches.
|
||||
Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) -> TPOT_penalty_ms
|
||||
|
||||
Usage:
|
||||
python driver.py --host 127.0.0.1 --port 8000 \
|
||||
--decode-batch-sizes 0,1,2,4,6,8,12 \
|
||||
--prefill-tokens 512,1024,2048,4096,8192,16384,32768 \
|
||||
--reps 5 --output-dir results/
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from dataclasses import dataclass, asdict
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
import numpy as np
|
||||
|
||||
|
||||
FIXED_SEED_PROMPT = (
|
||||
"You are a helpful assistant. Please analyze the following document carefully "
|
||||
"and provide a comprehensive summary covering all key points, main arguments, "
|
||||
"supporting evidence, and conclusions. The document discusses various aspects "
|
||||
"of distributed systems, including consensus protocols, fault tolerance mechanisms, "
|
||||
"and performance optimization strategies for large-scale deployments.\n\n"
|
||||
) * 50 # ~4k tokens worth of repeated text for prefix cache sharing
|
||||
|
||||
WARMUP_TOKENS = 32
|
||||
MEASURE_WINDOW_TOKENS = 500
|
||||
|
||||
|
||||
@dataclass
|
||||
class Config:
|
||||
decode_batch_size: int
|
||||
new_prefill_tokens: int
|
||||
chunk_size: int
|
||||
model: str
|
||||
repetition: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaselineResult:
|
||||
tpot_p50_ms: float
|
||||
tpot_p90_ms: float
|
||||
tpot_p99_ms: float
|
||||
tokens_collected: int
|
||||
|
||||
|
||||
@dataclass
|
||||
class InterferenceResult:
|
||||
tpot_during_prefill_p50_ms: float
|
||||
tpot_during_prefill_p90_ms: float
|
||||
tpot_after_prefill_p50_ms: float
|
||||
prefill_ttft_ms: float
|
||||
num_tokens_during_prefill: int
|
||||
|
||||
|
||||
async def stream_tokens(client: httpx.AsyncClient, url: str, payload: dict) -> list[float]:
|
||||
"""Send a streaming request, return list of timestamps (seconds) for each token."""
|
||||
timestamps = []
|
||||
async with client.stream("POST", url, json=payload, timeout=300.0) as resp:
|
||||
resp.raise_for_status()
|
||||
async for line in resp.aiter_lines():
|
||||
if line.startswith("data: "):
|
||||
data = line[6:]
|
||||
if data.strip() == "[DONE]":
|
||||
break
|
||||
try:
|
||||
chunk = json.loads(data)
|
||||
choices = chunk.get("choices", [])
|
||||
if not choices:
|
||||
continue
|
||||
delta = choices[0].get("delta", {})
|
||||
if "role" in delta:
|
||||
continue
|
||||
timestamps.append(time.perf_counter())
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
return timestamps
|
||||
|
||||
|
||||
def compute_tpot(timestamps: list[float], skip_first: int = 0) -> np.ndarray:
|
||||
"""Compute inter-token intervals in ms, skipping first N tokens."""
|
||||
if len(timestamps) < skip_first + 2:
|
||||
return np.array([])
|
||||
ts = np.array(timestamps[skip_first:])
|
||||
return np.diff(ts) * 1000.0 # seconds → ms
|
||||
|
||||
|
||||
def make_decode_payload(model: str) -> dict:
|
||||
return {
|
||||
"model": model,
|
||||
"messages": [{"role": "user", "content": FIXED_SEED_PROMPT}],
|
||||
"max_tokens": WARMUP_TOKENS + MEASURE_WINDOW_TOKENS + 50,
|
||||
"temperature": 0,
|
||||
"stream": True,
|
||||
}
|
||||
|
||||
|
||||
def make_prefill_payload(model: str, num_tokens: int) -> dict:
|
||||
import hashlib
|
||||
import uuid
|
||||
# Generate UNIQUE content every call to guarantee zero prefix cache hits.
|
||||
# Calibration: each "Block N: <32-hex>" → ~35 tokens after tokenization
|
||||
unique_id = f"{uuid.uuid4().hex}_{time.time_ns()}"
|
||||
n_parts = max(1, num_tokens // 35)
|
||||
content_parts = []
|
||||
for i in range(n_parts):
|
||||
seed = hashlib.md5(f"{unique_id}_{i}".encode()).hexdigest()
|
||||
content_parts.append(f"Block {i}: {seed}")
|
||||
content = " ".join(content_parts)
|
||||
return {
|
||||
"model": model,
|
||||
"messages": [{"role": "user", "content": content}],
|
||||
"max_tokens": 1,
|
||||
"temperature": 0,
|
||||
"stream": True,
|
||||
}
|
||||
|
||||
|
||||
async def wait_for_steady_state(decode_streams: list[asyncio.Task], min_tokens: int = 32):
|
||||
"""Wait until all decode streams have emitted at least min_tokens."""
|
||||
# We don't directly control this — we wait a fixed time based on expected TPOT
|
||||
# At ~50ms/token, 32 tokens ≈ 1.6s. Wait 3s to be safe.
|
||||
await asyncio.sleep(3.0)
|
||||
|
||||
|
||||
async def run_baseline(
|
||||
client: httpx.AsyncClient, url: str, model: str, decode_batch_size: int
|
||||
) -> Optional[BaselineResult]:
|
||||
"""Measure decode-only TPOT (no prefill interference)."""
|
||||
if decode_batch_size == 0:
|
||||
return BaselineResult(tpot_p50_ms=0, tpot_p90_ms=0, tpot_p99_ms=0, tokens_collected=0)
|
||||
|
||||
payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
|
||||
tasks = [asyncio.create_task(stream_tokens(client, url, p)) for p in payloads]
|
||||
|
||||
all_timestamps = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
all_tpots = []
|
||||
for ts in all_timestamps:
|
||||
if isinstance(ts, Exception):
|
||||
print(f" [WARN] decode stream error: {ts}")
|
||||
continue
|
||||
tpot = compute_tpot(ts, skip_first=WARMUP_TOKENS)
|
||||
if len(tpot) > 0:
|
||||
all_tpots.extend(tpot.tolist())
|
||||
|
||||
if not all_tpots:
|
||||
return None
|
||||
|
||||
arr = np.array(all_tpots)
|
||||
return BaselineResult(
|
||||
tpot_p50_ms=float(np.percentile(arr, 50)),
|
||||
tpot_p90_ms=float(np.percentile(arr, 90)),
|
||||
tpot_p99_ms=float(np.percentile(arr, 99)),
|
||||
tokens_collected=len(arr),
|
||||
)
|
||||
|
||||
|
||||
async def run_interference(
|
||||
client: httpx.AsyncClient,
|
||||
url: str,
|
||||
model: str,
|
||||
decode_batch_size: int,
|
||||
new_prefill_tokens: int,
|
||||
) -> Optional[InterferenceResult]:
|
||||
"""Measure TPOT while a prefill request is being processed."""
|
||||
if decode_batch_size == 0:
|
||||
# No decode to interfere with; just measure prefill TTFT
|
||||
prefill_payload = make_prefill_payload(model, new_prefill_tokens)
|
||||
t_start = time.perf_counter()
|
||||
ts = await stream_tokens(client, url, prefill_payload)
|
||||
prefill_ttft = (ts[0] - t_start) * 1000.0 if ts else 0
|
||||
return InterferenceResult(
|
||||
tpot_during_prefill_p50_ms=0,
|
||||
tpot_during_prefill_p90_ms=0,
|
||||
tpot_after_prefill_p50_ms=0,
|
||||
prefill_ttft_ms=prefill_ttft,
|
||||
num_tokens_during_prefill=0,
|
||||
)
|
||||
|
||||
# Phase 1: Start decode streams
|
||||
decode_payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
|
||||
|
||||
decode_timestamps: list[list[float]] = [[] for _ in range(decode_batch_size)]
|
||||
prefill_done_event = asyncio.Event()
|
||||
prefill_ttft_ms = 0.0
|
||||
prefill_inject_time = 0.0
|
||||
|
||||
async def decode_stream_with_tracking(idx: int, payload: dict):
|
||||
timestamps = await stream_tokens(client, url, payload)
|
||||
decode_timestamps[idx] = timestamps
|
||||
|
||||
async def prefill_after_warmup():
|
||||
nonlocal prefill_ttft_ms, prefill_inject_time
|
||||
# Wait for decode streams to stabilize
|
||||
await asyncio.sleep(1.0)
|
||||
prefill_inject_time = time.perf_counter()
|
||||
prefill_payload = make_prefill_payload(model, new_prefill_tokens)
|
||||
ts = await stream_tokens(client, url, prefill_payload)
|
||||
if ts:
|
||||
prefill_ttft_ms = (ts[0] - prefill_inject_time) * 1000.0
|
||||
prefill_done_event.set()
|
||||
|
||||
# Launch all
|
||||
decode_tasks = [
|
||||
asyncio.create_task(decode_stream_with_tracking(i, p))
|
||||
for i, p in enumerate(decode_payloads)
|
||||
]
|
||||
prefill_task = asyncio.create_task(prefill_after_warmup())
|
||||
|
||||
await asyncio.gather(*decode_tasks, prefill_task, return_exceptions=True)
|
||||
|
||||
# Analyze: split decode tokens into "during prefill" and "after prefill"
|
||||
prefill_end_time = prefill_inject_time + prefill_ttft_ms / 1000.0
|
||||
|
||||
tpot_during = []
|
||||
tpot_after = []
|
||||
|
||||
for ts_list in decode_timestamps:
|
||||
if len(ts_list) < WARMUP_TOKENS + 5:
|
||||
continue
|
||||
for i in range(WARMUP_TOKENS + 1, len(ts_list)):
|
||||
t_prev = ts_list[i - 1]
|
||||
t_curr = ts_list[i]
|
||||
interval_ms = (t_curr - t_prev) * 1000.0
|
||||
|
||||
if prefill_inject_time <= t_prev <= prefill_end_time:
|
||||
tpot_during.append(interval_ms)
|
||||
elif t_curr > prefill_end_time + 0.05: # 50ms after prefill settles
|
||||
tpot_after.append(interval_ms)
|
||||
|
||||
during_arr = np.array(tpot_during) if tpot_during else np.array([0.0])
|
||||
after_arr = np.array(tpot_after) if tpot_after else np.array([0.0])
|
||||
|
||||
return InterferenceResult(
|
||||
tpot_during_prefill_p50_ms=float(np.percentile(during_arr, 50)),
|
||||
tpot_during_prefill_p90_ms=float(np.percentile(during_arr, 90)),
|
||||
tpot_after_prefill_p50_ms=float(np.percentile(after_arr, 50)),
|
||||
prefill_ttft_ms=prefill_ttft_ms,
|
||||
num_tokens_during_prefill=len(tpot_during),
|
||||
)
|
||||
|
||||
|
||||
async def run_single_config(
|
||||
client: httpx.AsyncClient,
|
||||
url: str,
|
||||
model: str,
|
||||
decode_batch_size: int,
|
||||
new_prefill_tokens: int,
|
||||
chunk_size: int,
|
||||
rep: int,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""Run one (D, P) configuration."""
|
||||
config = Config(
|
||||
decode_batch_size=decode_batch_size,
|
||||
new_prefill_tokens=new_prefill_tokens,
|
||||
chunk_size=chunk_size,
|
||||
model=model,
|
||||
repetition=rep,
|
||||
)
|
||||
|
||||
print(f" [rep {rep}] Running baseline (D={decode_batch_size})...")
|
||||
baseline = await run_baseline(client, url, model, decode_batch_size)
|
||||
if baseline is None:
|
||||
print(f" [rep {rep}] Baseline failed, skipping")
|
||||
return
|
||||
|
||||
# Brief cooldown between baseline and interference
|
||||
await asyncio.sleep(2.0)
|
||||
|
||||
print(f" [rep {rep}] Running interference (D={decode_batch_size}, P={new_prefill_tokens})...")
|
||||
interference = await run_interference(
|
||||
client, url, model, decode_batch_size, new_prefill_tokens
|
||||
)
|
||||
if interference is None:
|
||||
print(f" [rep {rep}] Interference measurement failed, skipping")
|
||||
return
|
||||
|
||||
# Compute derived metrics
|
||||
tpot_penalty_p50 = interference.tpot_during_prefill_p50_ms - baseline.tpot_p50_ms
|
||||
penalty_ratio = (
|
||||
interference.tpot_during_prefill_p50_ms / baseline.tpot_p50_ms
|
||||
if baseline.tpot_p50_ms > 0 else 0
|
||||
)
|
||||
|
||||
result = {
|
||||
"config": asdict(config),
|
||||
"baseline": asdict(baseline),
|
||||
"interference": asdict(interference),
|
||||
"derived": {
|
||||
"tpot_penalty_p50_ms": tpot_penalty_p50,
|
||||
"tpot_penalty_ratio": penalty_ratio,
|
||||
},
|
||||
}
|
||||
|
||||
# Save
|
||||
fname = f"D{decode_batch_size}_P{new_prefill_tokens}_rep{rep}.json"
|
||||
out_path = output_dir / fname
|
||||
out_path.write_text(json.dumps(result, indent=2))
|
||||
print(f" [rep {rep}] Done. penalty={tpot_penalty_p50:.1f}ms ratio={penalty_ratio:.2f}")
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Prefill-Decode Interference Microbenchmark")
|
||||
parser.add_argument("--host", default="127.0.0.1")
|
||||
parser.add_argument("--port", type=int, default=8000)
|
||||
parser.add_argument("--model", default="Qwen3-Coder-30B-A3B-Instruct")
|
||||
parser.add_argument("--decode-batch-sizes", default="0,1,2,4,6,8,12",
|
||||
help="Comma-separated decode batch sizes")
|
||||
parser.add_argument("--prefill-tokens", default="512,1024,2048,4096,8192,16384,32768",
|
||||
help="Comma-separated prefill token counts")
|
||||
parser.add_argument("--chunk-size", type=int, default=8192,
|
||||
help="vLLM max_num_batched_tokens (effective chunk size)")
|
||||
parser.add_argument("--reps", type=int, default=5)
|
||||
parser.add_argument("--output-dir", default="results/interference")
|
||||
args = parser.parse_args()
|
||||
|
||||
decode_sizes = [int(x) for x in args.decode_batch_sizes.split(",")]
|
||||
prefill_tokens = [int(x) for x in args.prefill_tokens.split(",")]
|
||||
|
||||
output_dir = Path(args.output_dir) / f"chunk{args.chunk_size}"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
url = f"http://{args.host}:{args.port}/v1/chat/completions"
|
||||
print(f"Target: {url}")
|
||||
print(f"Model: {args.model}")
|
||||
print(f"Chunk size: {args.chunk_size}")
|
||||
print(f"Decode batch sizes: {decode_sizes}")
|
||||
print(f"Prefill tokens: {prefill_tokens}")
|
||||
print(f"Repetitions: {args.reps}")
|
||||
print(f"Output: {output_dir}")
|
||||
print()
|
||||
|
||||
async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client:
|
||||
# Sanity check: is the server up?
|
||||
try:
|
||||
resp = await client.get(f"http://{args.host}:{args.port}/v1/models")
|
||||
resp.raise_for_status()
|
||||
models = resp.json()
|
||||
print(f"Server ready. Models: {[m['id'] for m in models.get('data', [])]}")
|
||||
except Exception as e:
|
||||
print(f"ERROR: Cannot reach server at {args.host}:{args.port}: {e}")
|
||||
return
|
||||
|
||||
total_configs = len(decode_sizes) * len(prefill_tokens)
|
||||
done = 0
|
||||
|
||||
for D in decode_sizes:
|
||||
for P in prefill_tokens:
|
||||
done += 1
|
||||
print(f"\n[{done}/{total_configs}] D={D}, P={P}")
|
||||
|
||||
for rep in range(args.reps):
|
||||
try:
|
||||
await run_single_config(
|
||||
client, url, args.model, D, P,
|
||||
args.chunk_size, rep, output_dir,
|
||||
)
|
||||
except Exception as e:
|
||||
print(f" [rep {rep}] ERROR: {e}")
|
||||
|
||||
# Cooldown between reps
|
||||
await asyncio.sleep(1.0)
|
||||
|
||||
# Cooldown between configs
|
||||
await asyncio.sleep(3.0)
|
||||
|
||||
print("\n\nDone! Results in:", output_dir)
|
||||
# Generate summary CSV
|
||||
await generate_summary(output_dir, args.chunk_size)
|
||||
|
||||
|
||||
async def generate_summary(output_dir: Path, chunk_size: int):
|
||||
"""Aggregate all per-run JSONs into a summary CSV."""
|
||||
import csv
|
||||
|
||||
rows = []
|
||||
for f in sorted(output_dir.glob("D*_P*_rep*.json")):
|
||||
data = json.loads(f.read_text())
|
||||
cfg = data["config"]
|
||||
bl = data["baseline"]
|
||||
itf = data["interference"]
|
||||
drv = data["derived"]
|
||||
rows.append({
|
||||
"chunk_size": cfg["chunk_size"],
|
||||
"decode_batch_size": cfg["decode_batch_size"],
|
||||
"new_prefill_tokens": cfg["new_prefill_tokens"],
|
||||
"repetition": cfg["repetition"],
|
||||
"tpot_baseline_p50_ms": bl["tpot_p50_ms"],
|
||||
"tpot_baseline_p90_ms": bl["tpot_p90_ms"],
|
||||
"tpot_during_prefill_p50_ms": itf["tpot_during_prefill_p50_ms"],
|
||||
"tpot_during_prefill_p90_ms": itf["tpot_during_prefill_p90_ms"],
|
||||
"tpot_after_prefill_p50_ms": itf["tpot_after_prefill_p50_ms"],
|
||||
"prefill_ttft_ms": itf["prefill_ttft_ms"],
|
||||
"num_tokens_during_prefill": itf["num_tokens_during_prefill"],
|
||||
"tpot_penalty_p50_ms": drv["tpot_penalty_p50_ms"],
|
||||
"tpot_penalty_ratio": drv["tpot_penalty_ratio"],
|
||||
})
|
||||
|
||||
if not rows:
|
||||
return
|
||||
|
||||
csv_path = output_dir / "summary.csv"
|
||||
with open(csv_path, "w", newline="") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
|
||||
writer.writeheader()
|
||||
writer.writerows(rows)
|
||||
print(f"Summary CSV written: {csv_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
80
microbench/fresh_setup/mb1_launch.sh
Normal file
80
microbench/fresh_setup/mb1_launch.sh
Normal file
@@ -0,0 +1,80 @@
|
||||
#!/usr/bin/env bash
|
||||
# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference).
|
||||
# No kv_connector — MB1 measures intra-GPU phase interference, not transfer.
|
||||
# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime
|
||||
# we want to characterize: how much benefit can PD-disagg buy on top of
|
||||
# the existing chunked-prefill colocated baseline?).
|
||||
#
|
||||
# Usage:
|
||||
# GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start
|
||||
# bash mb1_launch.sh status
|
||||
# bash mb1_launch.sh stop
|
||||
|
||||
set -eo pipefail
|
||||
|
||||
FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
|
||||
VENV="${FRESH_ROOT}/.venv"
|
||||
MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
|
||||
LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}"
|
||||
|
||||
GPU="${GPU:-0}"
|
||||
PORT="${PORT:-8000}"
|
||||
MASTER="${MASTER:-29500}"
|
||||
# max_num_batched_tokens — controls the chunked-prefill chunk granularity.
|
||||
# vLLM 0.18.1 default is 8192; we keep that as the headline run and
|
||||
# optionally repeat at 32768 to expose the chunk-size effect.
|
||||
CHUNK_TOKENS="${CHUNK_TOKENS:-8192}"
|
||||
|
||||
mkdir -p "${LOGS_DIR}"
|
||||
|
||||
stop_local() {
|
||||
pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 2
|
||||
}
|
||||
|
||||
case "${1:-start}" in
|
||||
stop)
|
||||
stop_local; exit 0;;
|
||||
status)
|
||||
if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then
|
||||
echo "port ${PORT}: UP"
|
||||
else
|
||||
echo "port ${PORT}: DOWN"
|
||||
fi
|
||||
exit 0;;
|
||||
start) ;;
|
||||
*) echo "Unknown command: $1"; exit 1;;
|
||||
esac
|
||||
|
||||
stop_local
|
||||
source "${VENV}/bin/activate"
|
||||
|
||||
echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)"
|
||||
|
||||
PYTHONHASHSEED=42 \
|
||||
CUDA_VISIBLE_DEVICES="${GPU}" \
|
||||
MASTER_PORT="${MASTER}" \
|
||||
nohup vllm serve "${MODEL}" \
|
||||
--host 0.0.0.0 --port "${PORT}" \
|
||||
--tensor-parallel-size 1 \
|
||||
--trust-remote-code --enable-prefix-caching \
|
||||
--dtype auto --gpu-memory-utilization 0.9 \
|
||||
--max-model-len 200000 \
|
||||
--max-num-batched-tokens "${CHUNK_TOKENS}" \
|
||||
--enable-prompt-tokens-details \
|
||||
> "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 &
|
||||
disown
|
||||
|
||||
echo "[mb1] waiting for /health on port ${PORT}..."
|
||||
tries=0
|
||||
while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do
|
||||
tries=$((tries+1))
|
||||
if [ ${tries} -gt 180 ]; then
|
||||
echo "[mb1] FATAL port ${PORT} did not come up in 6 min"
|
||||
tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true
|
||||
exit 1
|
||||
fi
|
||||
sleep 2
|
||||
done
|
||||
echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})"
|
||||
129
microbench/fresh_setup/plot_mb1.py
Normal file
129
microbench/fresh_setup/plot_mb1.py
Normal file
@@ -0,0 +1,129 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
|
||||
|
||||
Two outputs:
|
||||
|
||||
mb1_interference.png
|
||||
Effective TPOT during prefill vs prefill size, one line per D.
|
||||
Log-log. Annotates typical agentic decode duration (~100 ms) as a
|
||||
horizontal band so reader can spot when decode would be stalled.
|
||||
|
||||
pd_cost_vs_benefit.png
|
||||
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
|
||||
- benefit ceiling (MB1) — at most one decode-duration per request
|
||||
of phase isolation can be recovered. Drawn as a flat 100 ms line.
|
||||
- cost (MB2) — Mooncake pure_transfer p50 at that size.
|
||||
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
|
||||
structurally loses.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--mb1", type=Path, required=True)
|
||||
p.add_argument("--mb2-intra", type=Path, required=True)
|
||||
p.add_argument("--mb2-inter", type=Path, default=None)
|
||||
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
|
||||
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
|
||||
args = p.parse_args()
|
||||
|
||||
mb1 = json.loads(args.mb1.read_text())["summary"]
|
||||
|
||||
# ---- mb1_interference.png ----
|
||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||||
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
||||
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
||||
for D in Ds:
|
||||
rows = [s for s in mb1 if s["decode_batch_size"] == D]
|
||||
rows.sort(key=lambda s: s["new_prefill_tokens"])
|
||||
xs = [s["new_prefill_tokens"] for s in rows]
|
||||
ys = [s["effective_tpot_during_ms"] for s in rows]
|
||||
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
||||
color=colors.get(D, "gray"),
|
||||
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||||
|
||||
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
|
||||
(100, "agentic decode (~100 ms)"),
|
||||
(200, "long agentic decode (~200 ms)")]:
|
||||
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
|
||||
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
|
||||
|
||||
ax.set_xscale("log"); ax.set_yscale("log")
|
||||
ax.set_xlabel("Prefill burst size (tokens, log)")
|
||||
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
||||
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
||||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
|
||||
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
|
||||
print(f"wrote {args.out_interf}")
|
||||
|
||||
# ---- pd_cost_vs_benefit.png ----
|
||||
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
|
||||
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
|
||||
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
|
||||
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||||
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
|
||||
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
|
||||
if args.mb2_inter:
|
||||
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
|
||||
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
|
||||
inter_x = [s["kv_mib"] for s in mb2_inter]
|
||||
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
|
||||
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
|
||||
alpha=0.7, label="MB2 inter-node (same numbers)")
|
||||
|
||||
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
|
||||
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
|
||||
label="MB1 max benefit ≤ agentic decode (~100 ms)")
|
||||
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
|
||||
label="benefit range (50–200 ms decode)")
|
||||
|
||||
# Mark agentic-tail request sizes
|
||||
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
|
||||
(3072, "p90\n(~33k tok)"),
|
||||
(6144, "p95\n(~65k tok)"),
|
||||
(11500, "p99\n(11.5 GiB)")]:
|
||||
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
|
||||
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
|
||||
ha="center", va="bottom")
|
||||
|
||||
ax.set_xscale("log"); ax.set_yscale("log")
|
||||
ax.set_xlim(40, 14000)
|
||||
ax.set_ylim(1, 12000)
|
||||
ax.set_xlabel("Per-request KV size (MiB, log)")
|
||||
ax.set_ylabel("Time per request (ms, log)")
|
||||
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
|
||||
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
|
||||
# Add explanatory annotation
|
||||
ax.text(10000, 5000,
|
||||
"Cost > benefit for ANY KV size above\n"
|
||||
"the green band (~80 MiB / ~830 tokens).\n"
|
||||
"Below: cost is marginal (<10 ms) but\n"
|
||||
"benefit is also small (decode is short).",
|
||||
fontsize=9, color="#333",
|
||||
ha="right", va="top",
|
||||
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
|
||||
|
||||
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
|
||||
print(f"wrote {args.out_cb}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user