Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware
load + sticky session affinity, the cache_aware_proxy default) and cap
each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap
is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack
metrics.summary.json so the analysis script computes from raw
metrics.jsonl).
Headline: the success-rate ceiling is policy-invariant.
arm linear (capped at 2x) lmetric (uncapped)
colo 807/807 = 100%, 964s 807/807 = 100%, 1021s
pd6 (6:2) 472/807 = 58%, 2280s ⊗ 474/807 = 59%, 3325s
pd4 (4:4) 349/807 = 43%, 2281s ⊗ 348/807 = 43%, 6850s
pd2 (2:6) 176/807 = 22%, 2280s ⊗ 180/807 = 22%, 19275s
Routing affects only how much wall is wasted timing out unreachable
requests at 600s each. Linear hits the same ceiling in 2280s as
LMetric does in 3300-19000s. This *strengthens* the §5 D-pool
capacity-ceiling thesis -- the cap is structural, not a routing
artifact.
Artifacts:
analysis/v2/fig4r_linear.json -- 4-arm linear summary
analysis/v2/PD_DISAGG_LMETRIC.md -- extended with wall-cap section
figs/v2/fig4_linear_vs_lmetric.png -- 3-panel side-by-side comparison
microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
Anchor experiment for the clean-stack PD comparison using the canonical
cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two
traces x four arms = eight runs on dash1.
Headline: with the right routing baseline (LMetric), PD-colo holds 100%
completion on both traces while every static PD-disagg ratio fails
(14-65% completion), and the failure mode rotates with the split --
no static partition has a working operating point on this workload.
LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR
7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is
structural (D-pool admission + multi-turn KV accumulation), not routing.
Completion matrix:
first600s full
colo 100% 100%
pd6 (6:2) 58.7% 65.3% (decode-bound)
pd4 (4:4) 43.1% 43.9% (both bottlenecks)
pd2 (2:6) 22.3% 13.9% (prefill-bound)
The original §3 RR "100% PD completion" appears to be a measurement
artifact of e13391e: producer-KV eviction acted as a relief valve,
letting more requests squeeze under the 600s timeout at the (uncosted)
price of cross-turn re-prefill. With the eviction off, PD-disagg is
worse than §3 advertised, not better.
Artifacts:
analysis/v2/fig4l_lmetric.json -- 8-arm summary data
analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe
figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure
microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script