Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
199 lines
6.5 KiB
JSON
199 lines
6.5 KiB
JSON
{
|
|
"summary": [
|
|
{
|
|
"decode_batch_size": 1,
|
|
"new_prefill_tokens": 2048,
|
|
"baseline_tpot_ms": 4.79,
|
|
"during_tpot_p50_ms_raw": 35.43,
|
|
"during_tpot_p90_ms_raw": 79.91,
|
|
"prefill_ttft_ms": 163.3,
|
|
"num_tokens_during_prefill_total": 4.0,
|
|
"per_stream_tokens_during": 4.0,
|
|
"effective_tpot_during_ms": 40.8,
|
|
"interference_penalty_x": 8.5,
|
|
"max_pd_disagg_benefit_ms_per_stream": 144.2
|
|
},
|
|
{
|
|
"decode_batch_size": 1,
|
|
"new_prefill_tokens": 8192,
|
|
"baseline_tpot_ms": 4.78,
|
|
"during_tpot_p50_ms_raw": 6.56,
|
|
"during_tpot_p90_ms_raw": 328.57,
|
|
"prefill_ttft_ms": 583.9,
|
|
"num_tokens_during_prefill_total": 5.0,
|
|
"per_stream_tokens_during": 5.0,
|
|
"effective_tpot_during_ms": 116.8,
|
|
"interference_penalty_x": 24.4,
|
|
"max_pd_disagg_benefit_ms_per_stream": 560.0
|
|
},
|
|
{
|
|
"decode_batch_size": 1,
|
|
"new_prefill_tokens": 32768,
|
|
"baseline_tpot_ms": 4.78,
|
|
"during_tpot_p50_ms_raw": 4.75,
|
|
"during_tpot_p90_ms_raw": 4.9,
|
|
"prefill_ttft_ms": 4515.3,
|
|
"num_tokens_during_prefill_total": 5.0,
|
|
"per_stream_tokens_during": 5.0,
|
|
"effective_tpot_during_ms": 903.1,
|
|
"interference_penalty_x": 188.8,
|
|
"max_pd_disagg_benefit_ms_per_stream": 4491.4
|
|
},
|
|
{
|
|
"decode_batch_size": 1,
|
|
"new_prefill_tokens": 65536,
|
|
"baseline_tpot_ms": 4.78,
|
|
"during_tpot_p50_ms_raw": 4.69,
|
|
"during_tpot_p90_ms_raw": 4.97,
|
|
"prefill_ttft_ms": 15567.6,
|
|
"num_tokens_during_prefill_total": 5.3,
|
|
"per_stream_tokens_during": 5.33,
|
|
"effective_tpot_during_ms": 2918.9,
|
|
"interference_penalty_x": 610.2,
|
|
"max_pd_disagg_benefit_ms_per_stream": 15542.0
|
|
},
|
|
{
|
|
"decode_batch_size": 1,
|
|
"new_prefill_tokens": 131072,
|
|
"baseline_tpot_ms": 4.78,
|
|
"during_tpot_p50_ms_raw": 4.71,
|
|
"during_tpot_p90_ms_raw": 4.9,
|
|
"prefill_ttft_ms": 56765.2,
|
|
"num_tokens_during_prefill_total": 5.7,
|
|
"per_stream_tokens_during": 5.67,
|
|
"effective_tpot_during_ms": 10017.4,
|
|
"interference_penalty_x": 2094.5,
|
|
"max_pd_disagg_benefit_ms_per_stream": 56738.1
|
|
},
|
|
{
|
|
"decode_batch_size": 4,
|
|
"new_prefill_tokens": 2048,
|
|
"baseline_tpot_ms": 5.62,
|
|
"during_tpot_p50_ms_raw": 22.18,
|
|
"during_tpot_p90_ms_raw": 84.85,
|
|
"prefill_ttft_ms": 138.3,
|
|
"num_tokens_during_prefill_total": 15.5,
|
|
"per_stream_tokens_during": 3.88,
|
|
"effective_tpot_during_ms": 35.7,
|
|
"interference_penalty_x": 6.3,
|
|
"max_pd_disagg_benefit_ms_per_stream": 116.6
|
|
},
|
|
{
|
|
"decode_batch_size": 4,
|
|
"new_prefill_tokens": 8192,
|
|
"baseline_tpot_ms": 6.08,
|
|
"during_tpot_p50_ms_raw": 8.45,
|
|
"during_tpot_p90_ms_raw": 515.39,
|
|
"prefill_ttft_ms": 574.1,
|
|
"num_tokens_during_prefill_total": 18.0,
|
|
"per_stream_tokens_during": 4.5,
|
|
"effective_tpot_during_ms": 127.6,
|
|
"interference_penalty_x": 21.0,
|
|
"max_pd_disagg_benefit_ms_per_stream": 546.8
|
|
},
|
|
{
|
|
"decode_batch_size": 4,
|
|
"new_prefill_tokens": 32768,
|
|
"baseline_tpot_ms": 6.09,
|
|
"during_tpot_p50_ms_raw": 9.83,
|
|
"during_tpot_p90_ms_raw": 1314.87,
|
|
"prefill_ttft_ms": 4529.1,
|
|
"num_tokens_during_prefill_total": 47.5,
|
|
"per_stream_tokens_during": 11.88,
|
|
"effective_tpot_during_ms": 381.4,
|
|
"interference_penalty_x": 62.7,
|
|
"max_pd_disagg_benefit_ms_per_stream": 4456.9
|
|
},
|
|
{
|
|
"decode_batch_size": 4,
|
|
"new_prefill_tokens": 65536,
|
|
"baseline_tpot_ms": 5.85,
|
|
"during_tpot_p50_ms_raw": 6.41,
|
|
"during_tpot_p90_ms_raw": 2077.47,
|
|
"prefill_ttft_ms": 15586.5,
|
|
"num_tokens_during_prefill_total": 79.0,
|
|
"per_stream_tokens_during": 19.75,
|
|
"effective_tpot_during_ms": 789.2,
|
|
"interference_penalty_x": 135.0,
|
|
"max_pd_disagg_benefit_ms_per_stream": 15471.0
|
|
},
|
|
{
|
|
"decode_batch_size": 4,
|
|
"new_prefill_tokens": 131072,
|
|
"baseline_tpot_ms": 6.27,
|
|
"during_tpot_p50_ms_raw": 6.3,
|
|
"during_tpot_p90_ms_raw": 4405.18,
|
|
"prefill_ttft_ms": 56697.1,
|
|
"num_tokens_during_prefill_total": 149.5,
|
|
"per_stream_tokens_during": 37.38,
|
|
"effective_tpot_during_ms": 1517.0,
|
|
"interference_penalty_x": 241.8,
|
|
"max_pd_disagg_benefit_ms_per_stream": 56462.6
|
|
},
|
|
{
|
|
"decode_batch_size": 8,
|
|
"new_prefill_tokens": 2048,
|
|
"baseline_tpot_ms": 7.71,
|
|
"during_tpot_p50_ms_raw": 8.38,
|
|
"during_tpot_p90_ms_raw": 98.98,
|
|
"prefill_ttft_ms": 143.1,
|
|
"num_tokens_during_prefill_total": 35.7,
|
|
"per_stream_tokens_during": 4.46,
|
|
"effective_tpot_during_ms": 32.1,
|
|
"interference_penalty_x": 4.2,
|
|
"max_pd_disagg_benefit_ms_per_stream": 108.8
|
|
},
|
|
{
|
|
"decode_batch_size": 8,
|
|
"new_prefill_tokens": 8192,
|
|
"baseline_tpot_ms": 7.69,
|
|
"during_tpot_p50_ms_raw": 9.34,
|
|
"during_tpot_p90_ms_raw": 519.29,
|
|
"prefill_ttft_ms": 583.3,
|
|
"num_tokens_during_prefill_total": 41.0,
|
|
"per_stream_tokens_during": 5.12,
|
|
"effective_tpot_during_ms": 113.8,
|
|
"interference_penalty_x": 14.8,
|
|
"max_pd_disagg_benefit_ms_per_stream": 543.9
|
|
},
|
|
{
|
|
"decode_batch_size": 8,
|
|
"new_prefill_tokens": 32768,
|
|
"baseline_tpot_ms": 7.42,
|
|
"during_tpot_p50_ms_raw": 11.61,
|
|
"during_tpot_p90_ms_raw": 1315.48,
|
|
"prefill_ttft_ms": 4520.3,
|
|
"num_tokens_during_prefill_total": 93.3,
|
|
"per_stream_tokens_during": 11.67,
|
|
"effective_tpot_during_ms": 387.5,
|
|
"interference_penalty_x": 52.2,
|
|
"max_pd_disagg_benefit_ms_per_stream": 4433.7
|
|
},
|
|
{
|
|
"decode_batch_size": 8,
|
|
"new_prefill_tokens": 65536,
|
|
"baseline_tpot_ms": 7.67,
|
|
"during_tpot_p50_ms_raw": 19.09,
|
|
"during_tpot_p90_ms_raw": 2471.4,
|
|
"prefill_ttft_ms": 15615.5,
|
|
"num_tokens_during_prefill_total": 165.0,
|
|
"per_stream_tokens_during": 20.62,
|
|
"effective_tpot_during_ms": 757.1,
|
|
"interference_penalty_x": 98.8,
|
|
"max_pd_disagg_benefit_ms_per_stream": 15457.4
|
|
},
|
|
{
|
|
"decode_batch_size": 8,
|
|
"new_prefill_tokens": 131072,
|
|
"baseline_tpot_ms": 7.74,
|
|
"during_tpot_p50_ms_raw": 11.51,
|
|
"during_tpot_p90_ms_raw": 4895.27,
|
|
"prefill_ttft_ms": 56991.4,
|
|
"num_tokens_during_prefill_total": 321.3,
|
|
"per_stream_tokens_during": 40.17,
|
|
"effective_tpot_during_ms": 1418.9,
|
|
"interference_penalty_x": 183.3,
|
|
"max_pd_disagg_benefit_ms_per_stream": 56680.4
|
|
}
|
|
]
|
|
} |