Files
agentic-kvc/microbench/fresh_setup/plot_mb1.py
Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:04:49 +08:00

64 lines
2.5 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""Plot MB1 phase-interference data.
Single output: figs/mb1_interference.png — effective per-stream TPOT
during a prefill burst, vs prefill size, one line per concurrent decode
batch size D.
Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
which composed a "max PD-disagg benefit = decode duration (50200 ms)
band" against the MB2 transfer-cost curve. That accounting was wrong
(see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
per-prefill-event, equal to D × T_prefill across stalled streams, not
capped by a single request's decode duration. That figure has been
removed; the math it implied was structurally backwards. The dominant
reason static PD-disagg fails in agentic is **D-side KV capacity**
(see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
"""
from __future__ import annotations
import argparse
import json
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
def main() -> None:
p = argparse.ArgumentParser()
p.add_argument("--mb1", type=Path, required=True)
p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
args = p.parse_args()
mb1 = json.loads(args.mb1.read_text())["summary"]
fig, ax = plt.subplots(figsize=(9, 5.5))
Ds = sorted({s["decode_batch_size"] for s in mb1})
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
for D in Ds:
rows = [s for s in mb1 if s["decode_batch_size"] == D]
rows.sort(key=lambda s: s["new_prefill_tokens"])
xs = [s["new_prefill_tokens"] for s in rows]
ys = [s["effective_tpot_during_ms"] for s in rows]
ax.plot(xs, ys, "o-", lw=2, markersize=7,
color=colors.get(D, "gray"),
label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
ax.set_xscale("log"); ax.set_yscale("log")
ax.set_xlabel("Prefill burst size (tokens, log)")
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
"(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
"Per-prefill aggregate stall = D × T_prefill.")
ax.grid(True, which="both", alpha=0.3)
ax.legend(loc="upper left", fontsize=9)
args.out.parent.mkdir(parents=True, exist_ok=True)
fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
print(f"wrote {args.out}")
if __name__ == "__main__":
main()