Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
99 lines
4.3 KiB
Python
99 lines
4.3 KiB
Python
#!/usr/bin/env python3
|
|
"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT.
|
|
|
|
The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be
|
|
misleading: chunked-prefill schedules decode alongside each prefill chunk,
|
|
so most decode-token intervals during the prefill burst look "normal" — but
|
|
each chunk completion creates a long-stall token. p50 hides this, p90
|
|
exposes it, but the BEST single-number summary of "how much was decode
|
|
slowed by prefill" is the *effective TPOT during the prefill burst*:
|
|
|
|
effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
|
|
|
|
i.e. wall-clock time divided by per-stream tokens emitted in that window.
|
|
This captures the true average throughput of each decode stream while a
|
|
prefill burst is underway. Compared to baseline_TPOT it gives the
|
|
"phase-interference penalty" PD-disagg could in principle recover.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import csv
|
|
import json
|
|
import statistics
|
|
from collections import defaultdict
|
|
from pathlib import Path
|
|
|
|
|
|
def main() -> None:
|
|
p = argparse.ArgumentParser()
|
|
p.add_argument("--summary", type=Path, required=True)
|
|
p.add_argument("--out", type=Path, required=True)
|
|
args = p.parse_args()
|
|
|
|
rows = list(csv.DictReader(args.summary.open()))
|
|
by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list)
|
|
for r in rows:
|
|
D = int(r["decode_batch_size"])
|
|
P = int(r["new_prefill_tokens"])
|
|
by_dp[(D, P)].append(r)
|
|
|
|
summary = []
|
|
for (D, P) in sorted(by_dp):
|
|
rs = by_dp[(D, P)]
|
|
base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs)
|
|
during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs
|
|
if float(r["tpot_during_prefill_p50_ms"]) > 0]
|
|
during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs
|
|
if float(r["tpot_during_prefill_p90_ms"]) > 0]
|
|
ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs]
|
|
n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs
|
|
if float(r["num_tokens_during_prefill"]) > 0]
|
|
|
|
if not n_tok_vals or D == 0:
|
|
continue
|
|
ttft = statistics.mean(ttft_vals)
|
|
n_tok_total = statistics.mean(n_tok_vals)
|
|
per_stream_tokens = n_tok_total / D
|
|
eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0
|
|
penalty_x = eff_tpot_during / base if base > 0 else 0
|
|
|
|
# PD-disagg potential benefit (per stream, ms):
|
|
# if decode ran at baseline rate throughout the prefill window,
|
|
# it would emit ttft/baseline tokens. Actual is per_stream_tokens.
|
|
# Time saved if no interference = ttft - per_stream_tokens * baseline
|
|
time_saved_per_stream = ttft - per_stream_tokens * base
|
|
|
|
summary.append({
|
|
"decode_batch_size": D,
|
|
"new_prefill_tokens": P,
|
|
"baseline_tpot_ms": round(base, 2),
|
|
"during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2)
|
|
if during_p50_vals else None),
|
|
"during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2)
|
|
if during_p90_vals else None),
|
|
"prefill_ttft_ms": round(ttft, 1),
|
|
"num_tokens_during_prefill_total": round(n_tok_total, 1),
|
|
"per_stream_tokens_during": round(per_stream_tokens, 2),
|
|
"effective_tpot_during_ms": round(eff_tpot_during, 1),
|
|
"interference_penalty_x": round(penalty_x, 1),
|
|
"max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1),
|
|
})
|
|
|
|
args.out.parent.mkdir(parents=True, exist_ok=True)
|
|
args.out.write_text(json.dumps({"summary": summary}, indent=2))
|
|
|
|
print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} "
|
|
f"{'penalty':>10} {'pd_benefit_ms':>15}")
|
|
for s in summary:
|
|
print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} "
|
|
f"{s['baseline_tpot_ms']:>9.2f} "
|
|
f"{s['effective_tpot_during_ms']:>15.1f} "
|
|
f"{s['interference_penalty_x']:>9.1f}x "
|
|
f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}")
|
|
print(f"\nwrote {args.out}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|