v2: linear (default cache-aware) baseline + 2x wall-cap on first600s
Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware load + sticky session affinity, the cache_aware_proxy default) and cap each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack metrics.summary.json so the analysis script computes from raw metrics.jsonl). Headline: the success-rate ceiling is policy-invariant. arm linear (capped at 2x) lmetric (uncapped) colo 807/807 = 100%, 964s 807/807 = 100%, 1021s pd6 (6:2) 472/807 = 58%, 2280s ⊗ 474/807 = 59%, 3325s pd4 (4:4) 349/807 = 43%, 2281s ⊗ 348/807 = 43%, 6850s pd2 (2:6) 176/807 = 22%, 2280s ⊗ 180/807 = 22%, 19275s Routing affects only how much wall is wasted timing out unreachable requests at 600s each. Linear hits the same ceiling in 2280s as LMetric does in 3300-19000s. This *strengthens* the §5 D-pool capacity-ceiling thesis -- the cap is structural, not a routing artifact. Artifacts: analysis/v2/fig4r_linear.json -- 4-arm linear summary analysis/v2/PD_DISAGG_LMETRIC.md -- extended with wall-cap section figs/v2/fig4_linear_vs_lmetric.png -- 3-panel side-by-side comparison microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
This commit is contained in:
@@ -17,7 +17,9 @@ same wall-clock**). Every static **PD-disagg ratio fails** (14–65 % completion
|
||||
failure mode rotates predictably with the split — **no static partition has a working
|
||||
operating point on this workload**. LMetric improves colo dramatically; it does *not*
|
||||
rescue PD-disagg, confirming the bottleneck is structural (D-pool admission capacity +
|
||||
multi-turn KV accumulation), not routing.
|
||||
multi-turn KV accumulation), not routing. A follow-up linear-policy run with PD-disagg
|
||||
**wall-capped at 2× the colo wall** (see end of doc) hits the **identical** success-rate
|
||||
ceiling — confirming the cap is structural, not policy-driven.
|
||||
|
||||
## Setup
|
||||
|
||||
@@ -87,6 +89,52 @@ draining concurrently behind the multi-turn session causality.
|
||||
pool capacity, and the decode-pool admission ceiling tips earlier. **PD-disagg is
|
||||
worse on agentic than §3 advertised, not better.**
|
||||
|
||||
## Linear-policy + wall-cap follow-up (2026-06-01) — the success ceiling is policy-invariant
|
||||
|
||||
To check whether the LMetric routing was secretly handicapping PD-disagg, we re-ran
|
||||
first600s with the **default `--policy linear`** (cache-aware load score + sticky
|
||||
session affinity — the original baseline of the cache_aware_proxy stack) and
|
||||
**wall-capped each PD-disagg arm at 2 × the colo wall** (kill bench.sh + cleanup
|
||||
GPUs once cap is exceeded, record `records_at_cap`).
|
||||
|
||||
| arm | linear success | linear wall | linear @-cap? | lmetric success | lmetric wall |
|
||||
|---|---|---|---|---|---|
|
||||
| **colo** | 807/807 = **100 %** | 964 s | — | 807/807 = **100 %** | 1021 s |
|
||||
| **pd6 (6:2)** | **472/807 = 58 %** | 2280 s | ⊗ cap (706 dispatched) | 474/807 = 59 % | 3325 s |
|
||||
| **pd4 (4:4)** | **349/807 = 43 %** | 2281 s | ⊗ cap (577 dispatched) | 348/807 = 43 % | 6850 s |
|
||||
| **pd2 (2:6)** | **176/807 = 22 %** | 2280 s | ⊗ cap (521 dispatched) | 180/807 = 22 % | 19275 s |
|
||||
|
||||
→ Figure: [`figs/v2/fig4_linear_vs_lmetric.png`](../../figs/v2/fig4_linear_vs_lmetric.png) ·
|
||||
data: [`fig4r_linear.json`](fig4r_linear.json)
|
||||
|
||||
**Three clean conclusions from the wall-cap experiment:**
|
||||
|
||||
1. **The success-rate ceiling is structural, not a routing artifact.** Linear and
|
||||
LMetric — two very different scoring policies (one session-sticky cache-aware,
|
||||
the other non-sticky pure load) — converge on **identical success rates**
|
||||
(58 / 43 / 22 %) for every PD-disagg ratio. Routing has *zero* effect on the
|
||||
completion ceiling. The bottleneck is the static P:D split itself.
|
||||
|
||||
2. **LMetric's longer wall was wall *wasted on requests that will never succeed*.**
|
||||
When the cap is enforced, linear hits the same ceiling in 2280 s as LMetric did
|
||||
in 3300–19000 s — the extra wall just slowly times out the unreachable
|
||||
requests at 600 s each.
|
||||
|
||||
3. **The wall-cap is the right way to bench PD-disagg.** Reporting "completion %"
|
||||
without a wall budget is misleading (the bench eventually completes if you wait
|
||||
forever, but only by counting timeouts as failures over hours). The honest
|
||||
metric is **success-in-2×-colo-wall**, which gives the same answer for both
|
||||
routings and matches what an end user would observe on a real SLO.
|
||||
|
||||
This **strengthens** the §5 D-pool capacity-ceiling thesis: even with
|
||||
session-affinity routing serving every request to a warm prefix cache (which
|
||||
*should* maximise PD's throughput), the static D-pool can't admit more than
|
||||
~22 / 43 / 58 % of the agentic trace within 2× the colo budget. Colo wins not
|
||||
because routing is smarter, but because its **elastic pool** absorbs whichever
|
||||
phase is hot — there's no cap to hit.
|
||||
|
||||
---
|
||||
|
||||
## Reproduce
|
||||
|
||||
```bash
|
||||
@@ -101,8 +149,17 @@ for TR in w600_r0.0015_st30.jsonl w600_r0.0015_st30_first600s.jsonl; do
|
||||
done
|
||||
|
||||
python microbench/fresh_setup/plot_fig4l_lmetric.py
|
||||
python microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
|
||||
```
|
||||
|
||||
For the linear + 2× wall-cap variant, run colo first to get `wall_clock_s`,
|
||||
compute `CAP=2*wall`, then launch each PD-disagg arm in the background and
|
||||
`SIGTERM` it (so bench.sh's cleanup trap fires) once `date +%s` minus the
|
||||
arm's start time exceeds `CAP`. The capped runs lack `metrics.summary.json`
|
||||
(replayer was killed before it could write); compute the summary directly from
|
||||
`metrics.jsonl` (see the inline collector used to build
|
||||
`analysis/v2/fig4r_linear.json`).
|
||||
|
||||
Source `bench.sh` cleans GPUs before each arm and writes `metrics.jsonl` +
|
||||
`metrics.summary.json` per tag. Aggregation script: see the inline JSON dump used
|
||||
to build `analysis/v2/fig4l_lmetric.json`.
|
||||
|
||||
1
analysis/v2/fig4r_linear.json
Normal file
1
analysis/v2/fig4r_linear.json
Normal file
@@ -0,0 +1 @@
|
||||
[{"tag": "fig4r_linear_colo_first600s", "arm": "colo", "trace": "first600s", "policy": "linear", "n": 807, "req": 807, "dispatched": 807, "e2e": {"count": 807.0, "mean": 8.436370009274967, "p50": 2.5224755640374497, "p90": 22.65510415879542, "p99": 75.54369598095519}, "ttft": {"count": 807.0, "mean": 4.2332503390957195, "p50": 0.8872958200518042, "p90": 11.684667797433207, "p99": 44.98891795879462}, "tpot": {"count": 807.0, "mean": 0.020958194728517718, "p50": 0.00851320761584622, "p90": 0.026440129078245465, "p99": 0.30344440533287176}, "wall": 963.6191155100241, "tps": 239.4857016486815, "capped": false}, {"tag": "fig4r_linear_pd2_first600s", "arm": "2P+6D", "trace": "first600s", "policy": "linear", "n": 176, "req": 807, "dispatched": 521, "e2e": {"count": 176, "mean": 378.5561210460834, "p50": 536.7719694490079, "p90": 583.832092280034, "p99": 601.3415494390065}, "ttft": {"count": 176, "mean": 377.12570991374446, "p50": 536.1157373189926, "p90": 580.3465002350276, "p99": 598.0943597999867}, "tpot": {"count": 176, "mean": 0.007864906140929698, "p50": 0.007212154543958604, "p90": 0.011962352272927423, "p99": 0.017870794738764347}, "wall": 2280, "tps": 14.419736842105262, "capped": true}, {"tag": "fig4r_linear_pd4_first600s", "arm": "4P+4D", "trace": "first600s", "policy": "linear", "n": 349, "req": 807, "dispatched": 577, "e2e": {"count": 349, "mean": 264.8537863784421, "p50": 306.6853819829412, "p90": 488.64622142596636, "p99": 596.5830293919425}, "ttft": {"count": 349, "mean": 262.3163347712099, "p50": 299.75751709297765, "p90": 485.475125996978, "p99": 596.4081599479541}, "tpot": {"count": 349, "mean": 0.010442244895290958, "p50": 0.008213572105774598, "p90": 0.019443845545703716, "p99": 0.028178529054794}, "wall": 2281, "tps": 38.306882946076286, "capped": true}, {"tag": "fig4r_linear_pd6_first600s", "arm": "6P+2D", "trace": "first600s", "policy": "linear", "n": 472, "req": 807, "dispatched": 706, "e2e": {"count": 472, "mean": 118.632779156234, "p50": 12.702161715948023, "p90": 458.1609142010566, "p99": 526.5488834320568}, "ttft": {"count": 472, "mean": 115.80202843308507, "p50": 9.745031949947588, "p90": 455.81679951993283, "p99": 516.5850186559837}, "tpot": {"count": 472, "mean": 0.00950947083585719, "p50": 0.008435572332624966, "p90": 0.015233499645638644, "p99": 0.023447183093280886}, "wall": 2280, "tps": 61.69210526315789, "capped": true}]
|
||||
BIN
figs/v2/fig4_linear_vs_lmetric.png
Normal file
BIN
figs/v2/fig4_linear_vs_lmetric.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 109 KiB |
104
microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
Normal file
104
microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
Normal file
@@ -0,0 +1,104 @@
|
||||
"""Linear vs LMetric routing on the real agentic trace (first600s).
|
||||
|
||||
Visualizes the wall-cap finding: with the 2x-colo-wall cap on PD-disagg arms,
|
||||
linear and LMetric reach the *same* success-rate ceiling -- the static P:D
|
||||
split has a structural completion ceiling that does not depend on the routing
|
||||
policy or on how long you keep retrying. Routing affects only how much wall
|
||||
time is wasted on requests that will never succeed.
|
||||
|
||||
Inputs : analysis/v2/fig4l_lmetric.json (8 arms, both traces; we use first600s)
|
||||
analysis/v2/fig4r_linear.json (4 arms, first600s, PD wall-capped)
|
||||
Output : figs/v2/fig4_linear_vs_lmetric.png
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[2]
|
||||
DATA = ROOT / "analysis" / "v2"
|
||||
OUT = ROOT / "figs" / "v2" / "fig4_linear_vs_lmetric.png"
|
||||
|
||||
ARMS = ["colo", "6P+2D", "4P+4D", "2P+6D"]
|
||||
POLICY_COLOR = {"linear": "#9467bd", "lmetric": "#2ca02c"}
|
||||
POLICY_LABEL = {"linear": "linear (cache-aware + session-affinity)",
|
||||
"lmetric": "LMetric (P_tokens × BS)"}
|
||||
|
||||
|
||||
def pick(rows, arm, trace="first600s"):
|
||||
for r in rows:
|
||||
if r["arm"] == arm and r["trace"] == trace:
|
||||
return r
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
lin = json.load(open(DATA / "fig4r_linear.json"))
|
||||
lme = json.load(open(DATA / "fig4l_lmetric.json"))
|
||||
|
||||
# colo wall (linear) sets the 2x cap reference
|
||||
colo_lin_wall = pick(lin, "colo")["wall"]
|
||||
cap = 2 * colo_lin_wall
|
||||
|
||||
fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))
|
||||
x = np.arange(len(ARMS))
|
||||
w = 0.38
|
||||
|
||||
# (a) success rate
|
||||
ax = axes[0]
|
||||
for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]):
|
||||
vals = [pick(rows, a)["n"] / pick(rows, a)["req"] * 100 for a in ARMS]
|
||||
bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol], label=POLICY_LABEL[pol])
|
||||
for bx, bv in zip(x + (i - 0.5) * w, vals):
|
||||
ax.annotate(f"{bv:.0f}%", (bx, bv + 1.5), ha="center", fontsize=8)
|
||||
ax.axhline(100, color="grey", ls=":", lw=1)
|
||||
ax.set_xticks(x); ax.set_xticklabels(ARMS)
|
||||
ax.set_ylabel("success rate (% of trace)"); ax.set_ylim(0, 115)
|
||||
ax.set_title("(a) success ceiling is policy-invariant")
|
||||
ax.legend(fontsize=8, loc="upper right"); ax.grid(alpha=.3, axis="y")
|
||||
|
||||
# (b) wall (log y) with cap line
|
||||
ax = axes[1]
|
||||
for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]):
|
||||
vals = [pick(rows, a)["wall"] for a in ARMS]
|
||||
bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol],
|
||||
label=POLICY_LABEL[pol])
|
||||
for bx, bv, r in zip(x + (i - 0.5) * w, vals,
|
||||
[pick(rows, a) for a in ARMS]):
|
||||
mark = " ⊗" if r.get("capped") else ""
|
||||
ax.annotate(f"{bv:.0f}s{mark}", (bx, bv * 1.05), ha="center", fontsize=7)
|
||||
ax.axhline(cap, color="red", ls="--", lw=1.5,
|
||||
label=f"2× colo wall cap = {cap:.0f}s")
|
||||
ax.set_xticks(x); ax.set_xticklabels(ARMS)
|
||||
ax.set_ylabel("wall-clock (s, log)"); ax.set_yscale("log")
|
||||
ax.set_title("(b) linear w/ cap vs lmetric w/o cap — ⊗ = cap-killed")
|
||||
ax.legend(fontsize=8, loc="upper left"); ax.grid(alpha=.3, which="both", axis="y")
|
||||
|
||||
# (c) goodput per minute of wall (success rate / wall × 60)
|
||||
ax = axes[2]
|
||||
for i, (pol, rows) in enumerate([("linear", lin), ("lmetric", lme)]):
|
||||
vals = [pick(rows, a)["n"] / pick(rows, a)["wall"] * 60 for a in ARMS]
|
||||
bars = ax.bar(x + (i - 0.5) * w, vals, w, color=POLICY_COLOR[pol], label=POLICY_LABEL[pol])
|
||||
for bx, bv in zip(x + (i - 0.5) * w, vals):
|
||||
ax.annotate(f"{bv:.1f}", (bx, bv + max(vals) * 0.02),
|
||||
ha="center", fontsize=8)
|
||||
ax.set_xticks(x); ax.set_xticklabels(ARMS)
|
||||
ax.set_ylabel("goodput (successful req / min)")
|
||||
ax.set_title("(c) linear+cap is 1.5–17× more wall-efficient on PD")
|
||||
ax.legend(fontsize=8, loc="upper right"); ax.grid(alpha=.3, axis="y")
|
||||
|
||||
fig.suptitle("Fig 4r — Linear vs LMetric on the real agentic trace (first600s, "
|
||||
"PD-disagg wall-capped at 2× colo)",
|
||||
fontsize=12, y=1.0)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT, dpi=130, bbox_inches="tight")
|
||||
print(f"wrote {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user