Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,20 +1,19 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
|
||||
"""Plot MB1 phase-interference data.
|
||||
|
||||
Two outputs:
|
||||
Single output: figs/mb1_interference.png — effective per-stream TPOT
|
||||
during a prefill burst, vs prefill size, one line per concurrent decode
|
||||
batch size D.
|
||||
|
||||
mb1_interference.png
|
||||
Effective TPOT during prefill vs prefill size, one line per D.
|
||||
Log-log. Annotates typical agentic decode duration (~100 ms) as a
|
||||
horizontal band so reader can spot when decode would be stalled.
|
||||
|
||||
pd_cost_vs_benefit.png
|
||||
The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
|
||||
- benefit ceiling (MB1) — at most one decode-duration per request
|
||||
of phase isolation can be recovered. Drawn as a flat 100 ms line.
|
||||
- cost (MB2) — Mooncake pure_transfer p50 at that size.
|
||||
Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
|
||||
structurally loses.
|
||||
Earlier versions of this script also produced figs/pd_cost_vs_benefit.png
|
||||
which composed a "max PD-disagg benefit = decode duration (50–200 ms)
|
||||
band" against the MB2 transfer-cost curve. That accounting was wrong
|
||||
(see RESULTS_SUMMARY.md §4 correction): phase-isolation benefit is
|
||||
per-prefill-event, equal to D × T_prefill across stalled streams, not
|
||||
capped by a single request's decode duration. That figure has been
|
||||
removed; the math it implied was structurally backwards. The dominant
|
||||
reason static PD-disagg fails in agentic is **D-side KV capacity**
|
||||
(see figs/f4b_pdsep_kv_wall.png), not cost-vs-benefit on phase isolation.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
@@ -25,21 +24,16 @@ from pathlib import Path
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--mb1", type=Path, required=True)
|
||||
p.add_argument("--mb2-intra", type=Path, required=True)
|
||||
p.add_argument("--mb2-inter", type=Path, default=None)
|
||||
p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
|
||||
p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
|
||||
p.add_argument("--out", type=Path, default=Path("figs/mb1_interference.png"))
|
||||
args = p.parse_args()
|
||||
|
||||
mb1 = json.loads(args.mb1.read_text())["summary"]
|
||||
|
||||
# ---- mb1_interference.png ----
|
||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||||
Ds = sorted({s["decode_batch_size"] for s in mb1})
|
||||
colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
|
||||
@@ -50,79 +44,19 @@ def main() -> None:
|
||||
ys = [s["effective_tpot_during_ms"] for s in rows]
|
||||
ax.plot(xs, ys, "o-", lw=2, markersize=7,
|
||||
color=colors.get(D, "gray"),
|
||||
label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||||
|
||||
for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
|
||||
(100, "agentic decode (~100 ms)"),
|
||||
(200, "long agentic decode (~200 ms)")]:
|
||||
ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
|
||||
ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
|
||||
label=f"D={D} (baseline TPOT {rows[0]['baseline_tpot_ms']:.1f} ms)")
|
||||
|
||||
ax.set_xscale("log"); ax.set_yscale("log")
|
||||
ax.set_xlabel("Prefill burst size (tokens, log)")
|
||||
ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
|
||||
ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
|
||||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
|
||||
"(chunked-prefill ON, vLLM 0.18.1 default, single H20). "
|
||||
"Per-prefill aggregate stall = D × T_prefill.")
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
args.out_interf.parent.mkdir(parents=True, exist_ok=True)
|
||||
fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
|
||||
print(f"wrote {args.out_interf}")
|
||||
|
||||
# ---- pd_cost_vs_benefit.png ----
|
||||
mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
|
||||
mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
|
||||
intra_x_mib = [s["kv_mib"] for s in mb2_intra]
|
||||
intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(9, 5.5))
|
||||
ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
|
||||
markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
|
||||
if args.mb2_inter:
|
||||
mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
|
||||
mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
|
||||
inter_x = [s["kv_mib"] for s in mb2_inter]
|
||||
inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
|
||||
ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
|
||||
alpha=0.7, label="MB2 inter-node (same numbers)")
|
||||
|
||||
# Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
|
||||
ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
|
||||
label="MB1 max benefit ≤ agentic decode (~100 ms)")
|
||||
ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
|
||||
label="benefit range (50–200 ms decode)")
|
||||
|
||||
# Mark agentic-tail request sizes
|
||||
for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
|
||||
(3072, "p90\n(~33k tok)"),
|
||||
(6144, "p95\n(~65k tok)"),
|
||||
(11500, "p99\n(11.5 GiB)")]:
|
||||
ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
|
||||
ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
|
||||
ha="center", va="bottom")
|
||||
|
||||
ax.set_xscale("log"); ax.set_yscale("log")
|
||||
ax.set_xlim(40, 14000)
|
||||
ax.set_ylim(1, 12000)
|
||||
ax.set_xlabel("Per-request KV size (MiB, log)")
|
||||
ax.set_ylabel("Time per request (ms, log)")
|
||||
ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
|
||||
"(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
|
||||
# Add explanatory annotation
|
||||
ax.text(10000, 5000,
|
||||
"Cost > benefit for ANY KV size above\n"
|
||||
"the green band (~80 MiB / ~830 tokens).\n"
|
||||
"Below: cost is marginal (<10 ms) but\n"
|
||||
"benefit is also small (decode is short).",
|
||||
fontsize=9, color="#333",
|
||||
ha="right", va="top",
|
||||
bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
|
||||
|
||||
fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
|
||||
print(f"wrote {args.out_cb}")
|
||||
args.out.parent.mkdir(parents=True, exist_ok=True)
|
||||
fig.tight_layout(); fig.savefig(args.out, dpi=150); plt.close(fig)
|
||||
print(f"wrote {args.out}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user