v2: linear (default cache-aware) baseline + 2x wall-cap on first600s

Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware
load + sticky session affinity, the cache_aware_proxy default) and cap
each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap
is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack
metrics.summary.json so the analysis script computes from raw
metrics.jsonl).

Headline: the success-rate ceiling is policy-invariant.

  arm        linear (capped at 2x)    lmetric (uncapped)
  colo       807/807 = 100%, 964s     807/807 = 100%, 1021s
  pd6 (6:2)  472/807 =  58%, 2280s ⊗  474/807 =  59%, 3325s
  pd4 (4:4)  349/807 =  43%, 2281s ⊗  348/807 =  43%, 6850s
  pd2 (2:6)  176/807 =  22%, 2280s ⊗  180/807 =  22%, 19275s

Routing affects only how much wall is wasted timing out unreachable
requests at 600s each. Linear hits the same ceiling in 2280s as
LMetric does in 3300-19000s. This *strengthens* the §5 D-pool
capacity-ceiling thesis -- the cap is structural, not a routing
artifact.

Artifacts:
  analysis/v2/fig4r_linear.json          -- 4-arm linear summary
  analysis/v2/PD_DISAGG_LMETRIC.md       -- extended with wall-cap section
  figs/v2/fig4_linear_vs_lmetric.png     -- 3-panel side-by-side comparison
  microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
This commit is contained in:
2026-06-01 00:55:40 +08:00
parent 7529284cee
commit 32f7f55990
4 changed files with 163 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB