diff --git a/analysis/v2/PD_DISAGG_LMETRIC.md b/analysis/v2/PD_DISAGG_LMETRIC.md index 125a637..12546fd 100644 --- a/analysis/v2/PD_DISAGG_LMETRIC.md +++ b/analysis/v2/PD_DISAGG_LMETRIC.md @@ -17,7 +17,9 @@ same wall-clock**). Every static **PD-disagg ratio fails** (14–65 % completion failure mode rotates predictably with the split — **no static partition has a working operating point on this workload**. LMetric improves colo dramatically; it does *not* rescue PD-disagg, confirming the bottleneck is structural (D-pool admission capacity + -multi-turn KV accumulation), not routing. +multi-turn KV accumulation), not routing. A follow-up linear-policy run with PD-disagg +**wall-capped at 2× the colo wall** (see end of doc) hits the **identical** success-rate +ceiling — confirming the cap is structural, not policy-driven. ## Setup @@ -87,6 +89,52 @@ draining concurrently behind the multi-turn session causality. pool capacity, and the decode-pool admission ceiling tips earlier. **PD-disagg is worse on agentic than §3 advertised, not better.** +## Linear-policy + wall-cap follow-up (2026-06-01) — the success ceiling is policy-invariant + +To check whether the LMetric routing was secretly handicapping PD-disagg, we re-ran +first600s with the **default `--policy linear`** (cache-aware load score + sticky +session affinity — the original baseline of the cache_aware_proxy stack) and +**wall-capped each PD-disagg arm at 2 × the colo wall** (kill bench.sh + cleanup +GPUs once cap is exceeded, record `records_at_cap`). + +| arm | linear success | linear wall | linear @-cap? | lmetric success | lmetric wall | +|---|---|---|---|---|---| +| **colo** | 807/807 = **100 %** | 964 s | — | 807/807 = **100 %** | 1021 s | +| **pd6 (6:2)** | **472/807 = 58 %** | 2280 s | ⊗ cap (706 dispatched) | 474/807 = 59 % | 3325 s | +| **pd4 (4:4)** | **349/807 = 43 %** | 2281 s | ⊗ cap (577 dispatched) | 348/807 = 43 % | 6850 s | +| **pd2 (2:6)** | **176/807 = 22 %** | 2280 s | ⊗ cap (521 dispatched) | 180/807 = 22 % | 19275 s | + +→ Figure: [`figs/v2/fig4_linear_vs_lmetric.png`](../../figs/v2/fig4_linear_vs_lmetric.png) · +data: [`fig4r_linear.json`](fig4r_linear.json) + +**Three clean conclusions from the wall-cap experiment:** + +1. **The success-rate ceiling is structural, not a routing artifact.** Linear and + LMetric — two very different scoring policies (one session-sticky cache-aware, + the other non-sticky pure load) — converge on **identical success rates** + (58 / 43 / 22 %) for every PD-disagg ratio. Routing has *zero* effect on the + completion ceiling. The bottleneck is the static P:D split itself. + +2. **LMetric's longer wall was wall *wasted on requests that will never succeed*.** + When the cap is enforced, linear hits the same ceiling in 2280 s as LMetric did + in 3300–19000 s — the extra wall just slowly times out the unreachable + requests at 600 s each. + +3. **The wall-cap is the right way to bench PD-disagg.** Reporting "completion %" + without a wall budget is misleading (the bench eventually completes if you wait + forever, but only by counting timeouts as failures over hours). The honest + metric is **success-in-2×-colo-wall**, which gives the same answer for both + routings and matches what an end user would observe on a real SLO. + +This **strengthens** the §5 D-pool capacity-ceiling thesis: even with +session-affinity routing serving every request to a warm prefix cache (which +*should* maximise PD's throughput), the static D-pool can't admit more than +~22 / 43 / 58 % of the agentic trace within 2× the colo budget. Colo wins not +because routing is smarter, but because its **elastic pool** absorbs whichever +phase is hot — there's no cap to hit. + +--- + ## Reproduce ```bash @@ -101,8 +149,17 @@ for TR in w600_r0.0015_st30.jsonl w600_r0.0015_st30_first600s.jsonl; do done python microbench/fresh_setup/plot_fig4l_lmetric.py +python microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py ``` +For the linear + 2× wall-cap variant, run colo first to get `wall_clock_s`, +compute `CAP=2*wall`, then launch each PD-disagg arm in the background and +`SIGTERM` it (so bench.sh's cleanup trap fires) once `date +%s` minus the +arm's start time exceeds `CAP`. The capped runs lack `metrics.summary.json` +(replayer was killed before it could write); compute the summary directly from +`metrics.jsonl` (see the inline collector used to build +`analysis/v2/fig4r_linear.json`). + Source `bench.sh` cleans GPUs before each arm and writes `metrics.jsonl` + `metrics.summary.json` per tag. Aggregation script: see the inline JSON dump used to build `analysis/v2/fig4l_lmetric.json`. diff --git a/analysis/v2/fig4r_linear.json b/analysis/v2/fig4r_linear.json new file mode 100644 index 0000000..cfdca10 --- /dev/null +++ b/analysis/v2/fig4r_linear.json @@ -0,0 +1 @@ +[{"tag": "fig4r_linear_colo_first600s", "arm": "colo", "trace": "first600s", "policy": "linear", "n": 807, "req": 807, "dispatched": 807, "e2e": {"count": 807.0, "mean": 8.436370009274967, "p50": 2.5224755640374497, "p90": 22.65510415879542, "p99": 75.54369598095519}, "ttft": {"count": 807.0, "mean": 4.2332503390957195, "p50": 0.8872958200518042, "p90": 11.684667797433207, "p99": 44.98891795879462}, "tpot": {"count": 807.0, "mean": 0.020958194728517718, "p50": 0.00851320761584622, "p90": 0.026440129078245465, "p99": 0.30344440533287176}, "wall": 963.6191155100241, "tps": 239.4857016486815, "capped": false}, {"tag": "fig4r_linear_pd2_first600s", "arm": "2P+6D", "trace": "first600s", "policy": "linear", "n": 176, "req": 807, "dispatched": 521, "e2e": {"count": 176, "mean": 378.5561210460834, "p50": 536.7719694490079, "p90": 583.832092280034, "p99": 601.3415494390065}, "ttft": {"count": 176, "mean": 377.12570991374446, "p50": 536.1157373189926, "p90": 580.3465002350276, "p99": 598.0943597999867}, "tpot": {"count": 176, "mean": 0.007864906140929698, "p50": 0.007212154543958604, "p90": 0.011962352272927423, "p99": 0.017870794738764347}, "wall": 2280, "tps": 14.419736842105262, "capped": true}, {"tag": "fig4r_linear_pd4_first600s", "arm": "4P+4D", "trace": "first600s", "policy": "linear", "n": 349, "req": 807, "dispatched": 577, "e2e": {"count": 349, "mean": 264.8537863784421, "p50": 306.6853819829412, "p90": 488.64622142596636, "p99": 596.5830293919425}, "ttft": {"count": 349, "mean": 262.3163347712099, "p50": 299.75751709297765, "p90": 485.475125996978, "p99": 596.4081599479541}, "tpot": {"count": 349, "mean": 0.010442244895290958, "p50": 0.008213572105774598, "p90": 0.019443845545703716, "p99": 0.028178529054794}, "wall": 2281, "tps": 38.306882946076286, "capped": true}, {"tag": "fig4r_linear_pd6_first600s", "arm": "6P+2D", "trace": "first600s", "policy": "linear", "n": 472, "req": 807, "dispatched": 706, "e2e": {"count": 472, "mean": 118.632779156234, "p50": 12.702161715948023, "p90": 458.1609142010566, "p99": 526.5488834320568}, "ttft": {"count": 472, "mean": 115.80202843308507, "p50": 9.745031949947588, "p90": 455.81679951993283, "p99": 516.5850186559837}, "tpot": {"count": 472, "mean": 0.00950947083585719, "p50": 0.008435572332624966, "p90": 0.015233499645638644, "p99": 0.023447183093280886}, "wall": 2280, "tps": 61.69210526315789, "capped": true}] diff --git a/figs/v2/fig4_linear_vs_lmetric.png b/figs/v2/fig4_linear_vs_lmetric.png new file mode 100644 index 0000000..24f67b2 Binary files /dev/null and b/figs/v2/fig4_linear_vs_lmetric.png differ diff --git a/microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py b/microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py new file mode 100644 index 0000000..9eaae0e --- /dev/null +++ b/microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py @@ -0,0 +1,104 @@ +"""Linear vs LMetric routing on the real agentic trace (first600s). + +Visualizes the wall-cap finding: with the 2x-colo-wall cap on PD-disagg arms, +linear and LMetric reach the *same* success-rate ceiling -- the static P:D +split has a structural completion ceiling that does not depend on the routing +policy or on how long you keep retrying. Routing affects only how much wall +time is wasted on requests that will never succeed. + +Inputs : analysis/v2/fig4l_lmetric.json (8 arms, both traces; we use first600s) + analysis/v2/fig4r_linear.json (4 arms, first600s, PD wall-capped) +Output : figs/v2/fig4_linear_vs_lmetric.png +""" +from __future__ import annotations + +import json +from pathlib import Path + +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +ROOT = Path(__file__).resolve().parents[2] +DATA = ROOT / "analysis" / "v2" +OUT = ROOT / "figs" / "v2" / "fig4_linear_vs_lmetric.png" + +ARMS = ["colo", "6P+2D", "4P+4D", "2P+6D"] +POLICY_COLOR = {"linear": "#9467bd", "lmetric": "#2ca02c"} +POLICY_LABEL = {"linear": "linear (cache-aware + session-affinity)", + "lmetric": "LMetric (P_tokens × BS)"} + + +def pick(rows, arm, trace="first600s"): + for r in rows: + if r["arm"] == arm and r["trace"] == trace: + return r + return None + + +def main(): + lin = json.load(open(DATA / "fig4r_linear.json")) + lme = json.load(open(DATA / "fig4l_lmetric.json")) + + # colo wall (linear) sets the 2x cap reference + colo_lin_wall = pick(lin, "colo")["wall"] + cap = 2 * colo_lin_wall + + fig, axes = plt.subplots(1, 3, figsize=(15, 4.5)) + x = np.arange(len(ARMS)) + w = 0.38 + + # (a) success rate + ax = axes[0] + for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]): + vals = [pick(rows, a)["n"] / pick(rows, a)["req"] * 100 for a in ARMS] + bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol], label=POLICY_LABEL[pol]) + for bx, bv in zip(x + (i - 0.5) * w, vals): + ax.annotate(f"{bv:.0f}%", (bx, bv + 1.5), ha="center", fontsize=8) + ax.axhline(100, color="grey", ls=":", lw=1) + ax.set_xticks(x); ax.set_xticklabels(ARMS) + ax.set_ylabel("success rate (% of trace)"); ax.set_ylim(0, 115) + ax.set_title("(a) success ceiling is policy-invariant") + ax.legend(fontsize=8, loc="upper right"); ax.grid(alpha=.3, axis="y") + + # (b) wall (log y) with cap line + ax = axes[1] + for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]): + vals = [pick(rows, a)["wall"] for a in ARMS] + bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol], + label=POLICY_LABEL[pol]) + for bx, bv, r in zip(x + (i - 0.5) * w, vals, + [pick(rows, a) for a in ARMS]): + mark = " ⊗" if r.get("capped") else "" + ax.annotate(f"{bv:.0f}s{mark}", (bx, bv * 1.05), ha="center", fontsize=7) + ax.axhline(cap, color="red", ls="--", lw=1.5, + label=f"2× colo wall cap = {cap:.0f}s") + ax.set_xticks(x); ax.set_xticklabels(ARMS) + ax.set_ylabel("wall-clock (s, log)"); ax.set_yscale("log") + ax.set_title("(b) linear w/ cap vs lmetric w/o cap — ⊗ = cap-killed") + ax.legend(fontsize=8, loc="upper left"); ax.grid(alpha=.3, which="both", axis="y") + + # (c) goodput per minute of wall (success rate / wall × 60) + ax = axes[2] + for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]): + vals = [pick(rows, a)["n"] / pick(rows, a)["wall"] * 60 for a in ARMS] + bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol], label=POLICY_LABEL[pol]) + for bx, bv in zip(x + (i - 0.5) * w, vals): + ax.annotate(f"{bv:.1f}", (bx, bv + max(vals) * 0.02), + ha="center", fontsize=8) + ax.set_xticks(x); ax.set_xticklabels(ARMS) + ax.set_ylabel("goodput (successful req / min)") + ax.set_title("(c) linear+cap is 1.5–17× more wall-efficient on PD") + ax.legend(fontsize=8, loc="upper right"); ax.grid(alpha=.3, axis="y") + + fig.suptitle("Fig 4r — Linear vs LMetric on the real agentic trace (first600s, " + "PD-disagg wall-capped at 2× colo)", + fontsize=12, y=1.0) + fig.tight_layout() + fig.savefig(OUT, dpi=130, bbox_inches="tight") + print(f"wrote {OUT}") + + +if __name__ == "__main__": + main()