The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
64 lines
2.5 KiB
Python
64 lines
2.5 KiB
Python
#!/usr/bin/env python3
|
||
"""Plot MB1 phase-interference data.
|
||
|
||
Single output: figs/mb1_interference.png — effective per-stream TPOT
|
||
during a prefill burst, vs prefill size, one line per concurrent decode
|
||
batch size D.
|
||
|
||
Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
|
||
which composed a "max PD-disagg benefit = decode duration (50–200 ms)
|
||
band" against the MB2 transfer-cost curve. That accounting was wrong
|
||
(see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
|
||
per-prefill-event, equal to D × T_prefill across stalled streams, not
|
||
capped by a single request's decode duration. That figure has been
|
||
removed; the math it implied was structurally backwards. The dominant
|
||
reason static PD-disagg fails in agentic is **D-side KV capacity**
|
||
(see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
import argparse
|
||
import json
|
||
from pathlib import Path
|
||
|
||
import matplotlib
|
||
matplotlib.use("Agg")
|
||
import matplotlib.pyplot as plt
|
||
|
||
|
||
def main() -> None:
|
||
p = argparse.ArgumentParser()
|
||
p.add_argument("--mb1", type=Path, required=True)
|
||
p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
|
||
args = p.parse_args()
|
||
|
||
mb1 = json.loads(args.mb1.read_text())["summary"]
|
||
|
||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
||
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
||
for D in Ds:
|
||
rows = [s for s in mb1 if s["decode_batch_size"] == D]
|
||
rows.sort(key=lambda s: s["new_prefill_tokens"])
|
||
xs = [s["new_prefill_tokens"] for s in rows]
|
||
ys = [s["effective_tpot_during_ms"] for s in rows]
|
||
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
||
color=colors.get(D, "gray"),
|
||
label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||
|
||
ax.set_xscale("log"); ax.set_yscale("log")
|
||
ax.set_xlabel("Prefill burst size (tokens, log)")
|
||
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
||
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
|
||
"Per-prefill aggregate stall = D × T_prefill.")
|
||
ax.grid(True, which="both", alpha=0.3)
|
||
ax.legend(loc="upper left", fontsize=9)
|
||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||
fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
|
||
print(f"wrote {args.out}")
|
||
|
||
|
||
if __name__ == "__main__":
|
||
main()
|