Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
130 lines
5.5 KiB
Python
130 lines
5.5 KiB
Python
#!/usr/bin/env python3
|
||
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
|
||
|
||
Two outputs:
|
||
|
||
mb1_interference.png
|
||
Effective TPOT during prefill vs prefill size, one line per D.
|
||
Log-log. Annotates typical agentic decode duration (~100 ms) as a
|
||
horizontal band so reader can spot when decode would be stalled.
|
||
|
||
pd_cost_vs_benefit.png
|
||
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
|
||
- benefit ceiling (MB1) — at most one decode-duration per request
|
||
of phase isolation can be recovered. Drawn as a flat 100 ms line.
|
||
- cost (MB2) — Mooncake pure_transfer p50 at that size.
|
||
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
|
||
structurally loses.
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
import argparse
|
||
import json
|
||
from pathlib import Path
|
||
|
||
import matplotlib
|
||
matplotlib.use("Agg")
|
||
import matplotlib.pyplot as plt
|
||
import numpy as np
|
||
|
||
|
||
def main() -> None:
|
||
p = argparse.ArgumentParser()
|
||
p.add_argument("--mb1", type=Path, required=True)
|
||
p.add_argument("--mb2-intra", type=Path, required=True)
|
||
p.add_argument("--mb2-inter", type=Path, default=None)
|
||
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
|
||
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
|
||
args = p.parse_args()
|
||
|
||
mb1 = json.loads(args.mb1.read_text())["summary"]
|
||
|
||
# ---- mb1_interference.png ----
|
||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
||
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
||
for D in Ds:
|
||
rows = [s for s in mb1 if s["decode_batch_size"] == D]
|
||
rows.sort(key=lambda s: s["new_prefill_tokens"])
|
||
xs = [s["new_prefill_tokens"] for s in rows]
|
||
ys = [s["effective_tpot_during_ms"] for s in rows]
|
||
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
||
color=colors.get(D, "gray"),
|
||
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||
|
||
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
|
||
(100, "agentic decode (~100 ms)"),
|
||
(200, "long agentic decode (~200 ms)")]:
|
||
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
|
||
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
|
||
|
||
ax.set_xscale("log"); ax.set_yscale("log")
|
||
ax.set_xlabel("Prefill burst size (tokens, log)")
|
||
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
||
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
|
||
ax.grid(True, which="both", alpha=0.3)
|
||
ax.legend(loc="upper left", fontsize=9)
|
||
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
|
||
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
|
||
print(f"wrote {args.out_interf}")
|
||
|
||
# ---- pd_cost_vs_benefit.png ----
|
||
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
|
||
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
|
||
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
|
||
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
|
||
|
||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
|
||
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
|
||
if args.mb2_inter:
|
||
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
|
||
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
|
||
inter_x = [s["kv_mib"] for s in mb2_inter]
|
||
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
|
||
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
|
||
alpha=0.7, label="MB2 inter-node (same numbers)")
|
||
|
||
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
|
||
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
|
||
label="MB1 max benefit ≤ agentic decode (~100 ms)")
|
||
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
|
||
label="benefit range (50–200 ms decode)")
|
||
|
||
# Mark agentic-tail request sizes
|
||
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
|
||
(3072, "p90\n(~33k tok)"),
|
||
(6144, "p95\n(~65k tok)"),
|
||
(11500, "p99\n(11.5 GiB)")]:
|
||
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
|
||
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
|
||
ha="center", va="bottom")
|
||
|
||
ax.set_xscale("log"); ax.set_yscale("log")
|
||
ax.set_xlim(40, 14000)
|
||
ax.set_ylim(1, 12000)
|
||
ax.set_xlabel("Per-request KV size (MiB, log)")
|
||
ax.set_ylabel("Time per request (ms, log)")
|
||
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
|
||
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
|
||
ax.grid(True, which="both", alpha=0.3)
|
||
ax.legend(loc="upper left", fontsize=9)
|
||
|
||
# Add explanatory annotation
|
||
ax.text(10000, 5000,
|
||
"Cost > benefit for ANY KV size above\n"
|
||
"the green band (~80 MiB / ~830 tokens).\n"
|
||
"Below: cost is marginal (<10 ms) but\n"
|
||
"benefit is also small (decode is short).",
|
||
fontsize=9, color="#333",
|
||
ha="right", va="top",
|
||
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
|
||
|
||
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
|
||
print(f"wrote {args.out_cb}")
|
||
|
||
|
||
if __name__ == "__main__":
|
||
main()
|