Files
agentic-kvc/microbench/fresh_setup/analyze_mb1.py
Gahow Wang 029821c1b6 MB1: prefill-decode interference under chunked-prefill default; §3.2 headline
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:25:09 +08:00

99 lines
4.3 KiB
Python

#!/usr/bin/env python3
"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT.
The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be
misleading: chunked-prefill schedules decode alongside each prefill chunk,
so most decode-token intervals during the prefill burst look "normal" — but
each chunk completion creates a long-stall token. p50 hides this, p90
exposes it, but the BEST single-number summary of "how much was decode
slowed by prefill" is the *effective TPOT during the prefill burst*:
effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
i.e. wall-clock time divided by per-stream tokens emitted in that window.
This captures the true average throughput of each decode stream while a
prefill burst is underway. Compared to baseline_TPOT it gives the
"phase-interference penalty" PD-disagg could in principle recover.
"""
from __future__ import annotations
import argparse
import csv
import json
import statistics
from collections import defaultdict
from pathlib import Path
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--summary", type=Path, required=True)
p.add_argument("--out", type=Path, required=True)
args = p.parse_args()
rows = list(csv.DictReader(args.summary.open()))
by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list)
for r in rows:
D = int(r["decode_batch_size"])
P = int(r["new_prefill_tokens"])
by_dp[(D, P)].append(r)
summary = []
for (D, P) in sorted(by_dp):
rs = by_dp[(D, P)]
base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs)
during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs
if float(r["tpot_during_prefill_p50_ms"]) > 0]
during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs
if float(r["tpot_during_prefill_p90_ms"]) > 0]
ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs]
n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs
if float(r["num_tokens_during_prefill"]) > 0]
if not n_tok_vals or D == 0:
continue
ttft = statistics.mean(ttft_vals)
n_tok_total = statistics.mean(n_tok_vals)
per_stream_tokens = n_tok_total / D
eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0
penalty_x = eff_tpot_during / base if base > 0 else 0
# PD-disagg potential benefit (per stream, ms):
# if decode ran at baseline rate throughout the prefill window,
# it would emit ttft/baseline tokens. Actual is per_stream_tokens.
# Time saved if no interference = ttft - per_stream_tokens * baseline
time_saved_per_stream = ttft - per_stream_tokens * base
summary.append({
"decode_batch_size": D,
"new_prefill_tokens": P,
"baseline_tpot_ms": round(base, 2),
"during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2)
if during_p50_vals else None),
"during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2)
if during_p90_vals else None),
"prefill_ttft_ms": round(ttft, 1),
"num_tokens_during_prefill_total": round(n_tok_total, 1),
"per_stream_tokens_during": round(per_stream_tokens, 2),
"effective_tpot_during_ms": round(eff_tpot_during, 1),
"interference_penalty_x": round(penalty_x, 1),
"max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1),
})
args.out.parent.mkdir(parents=True, exist_ok=True)
args.out.write_text(json.dumps({"summary": summary}, indent=2))
print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} "
f"{'penalty':>10} {'pd_benefit_ms':>15}")
for s in summary:
print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} "
f"{s['baseline_tpot_ms']:>9.2f} "
f"{s['effective_tpot_during_ms']:>15.1f} "
f"{s['interference_penalty_x']:>9.1f}x "
f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}")
print(f"\nwrote {args.out}")
if __name__ == "__main__":
main()