Files
agentic-kvc/microbench/fresh_setup/plot_mb1.py
Gahow Wang 029821c1b6 MB1: prefill-decode interference under chunked-prefill default; §3.2 headline
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.

Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
  = prefill_ttft / (num_tokens_during_prefill / D). This is the average
  rate at which each decode stream produces tokens during the burst.
  p50 of inter-token intervals is deceptive (chunked-prefill makes most
  intervals look normal); the burst-average gives the true cost.

Results (D=8 row, the most agentic-realistic case):
  P (tokens) | prefill_ttft | per-stream TPOT during | penalty
       2048  |    143 ms    |      32 ms             |    4×
       8192  |    583 ms    |     114 ms             |   15×
      32768  |  4520 ms     |     388 ms             |   52×
      65536  | 15615 ms     |     757 ms             |   99×
     131072  | 56991 ms     |    1419 ms             |  183×

Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.

§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).

The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.

Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
  kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
  microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
  per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
  pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)

Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
  interleave decode more aggressively. Chunk-size sensitivity is
  flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:25:09 +08:00

130 lines
5.5 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
Two outputs:
mb1_interference.png
Effective TPOT during prefill vs prefill size, one line per D.
Log-log. Annotates typical agentic decode duration (~100 ms) as a
horizontal band so reader can spot when decode would be stalled.
pd_cost_vs_benefit.png
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
- benefit ceiling (MB1) — at most one decode-duration per request
of phase isolation can be recovered. Drawn as a flat 100 ms line.
- cost (MB2) — Mooncake pure_transfer p50 at that size.
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
structurally loses.
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--mb1", type=Path, required=True)
p.add_argument("--mb2-intra", type=Path, required=True)
p.add_argument("--mb2-inter", type=Path, default=None)
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
args = p.parse_args()
mb1 = json.loads(args.mb1.read_text())["summary"]
# ---- mb1_interference.png ----
fig, ax = plt.subplots(figsize=(9, 5.5))
Ds = sorted({s["decode_batch_size"] for s in mb1})
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
for D in Ds:
rows = [s for s in mb1 if s["decode_batch_size"] == D]
rows.sort(key=lambda s: s["new_prefill_tokens"])
xs = [s["new_prefill_tokens"] for s in rows]
ys = [s["effective_tpot_during_ms"] for s in rows]
ax.plot(xs, ys, "o-", lw=2, markersize=7,
color=colors.get(D, "gray"),
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
(100, "agentic decode (~100 ms)"),
(200, "long agentic decode (~200 ms)")]:
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlabel("Prefill burst size (tokens, log)")
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
print(f"wrote {args.out_interf}")
# ---- pd_cost_vs_benefit.png ----
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
fig, ax = plt.subplots(figsize=(9, 5.5))
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
if args.mb2_inter:
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
inter_x = [s["kv_mib"] for s in mb2_inter]
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
alpha=0.7, label="MB2 inter-node (same numbers)")
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
label="MB1 max benefit ≤ agentic decode (~100 ms)")
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
label="benefit range (50200 ms decode)")
# Mark agentic-tail request sizes
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
(3072, "p90\n(~33k tok)"),
(6144, "p95\n(~65k tok)"),
(11500, "p99\n(11.5 GiB)")]:
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
ha="center", va="bottom")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlim(40, 14000)
ax.set_ylim(1, 12000)
ax.set_xlabel("Per-request KV size (MiB, log)")
ax.set_ylabel("Time per request (ms, log)")
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
# Add explanatory annotation
ax.text(10000, 5000,
"Cost > benefit for ANY KV size above\n"
"the green band (~80 MiB / ~830 tokens).\n"
"Below: cost is marginal (<10 ms) but\n"
"benefit is also small (decode is short).",
fontsize=9, color="#333",
ha="right", va="top",
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
print(f"wrote {args.out_cb}")
if __name__ == "__main__":
main()