agentic-kvc

Author SHA1 Message Date

Author	SHA1	Message	Date
Gahow Wang	7529284cee	v2: LMetric PD-colo vs PD-disagg on the real agentic trace Anchor experiment for the clean-stack PD comparison using the canonical cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two traces x four arms = eight runs on dash1. Headline: with the right routing baseline (LMetric), PD-colo holds 100% completion on both traces while every static PD-disagg ratio fails (14-65% completion), and the failure mode rotates with the split -- no static partition has a working operating point on this workload. LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR 7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is structural (D-pool admission + multi-turn KV accumulation), not routing. Completion matrix: first600s full colo 100% 100% pd6 (6:2) 58.7% 65.3% (decode-bound) pd4 (4:4) 43.1% 43.9% (both bottlenecks) pd2 (2:6) 22.3% 13.9% (prefill-bound) The original §3 RR "100% PD completion" appears to be a measurement artifact of `e13391e`: producer-KV eviction acted as a relief valve, letting more requests squeeze under the 600s timeout at the (uncosted) price of cross-turn re-prefill. With the eviction off, PD-disagg is worse than §3 advertised, not better. Artifacts: analysis/v2/fig4l_lmetric.json -- 8-arm summary data analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script	2026-05-31 20:15:10 +08:00

Gahow Wang

7529284cee

v2: LMetric PD-colo vs PD-disagg on the real agentic trace

Anchor experiment for the clean-stack PD comparison using the canonical
cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two
traces x four arms = eight runs on dash1.

Headline: with the right routing baseline (LMetric), PD-colo holds 100%
completion on both traces while every static PD-disagg ratio fails
(14-65% completion), and the failure mode rotates with the split --
no static partition has a working operating point on this workload.
LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR
7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is
structural (D-pool admission + multi-turn KV accumulation), not routing.

Completion matrix:
                    first600s  full
  colo                 100%    100%
  pd6 (6:2)            58.7%   65.3%   (decode-bound)
  pd4 (4:4)            43.1%   43.9%   (both bottlenecks)
  pd2 (2:6)            22.3%   13.9%   (prefill-bound)

The original §3 RR "100% PD completion" appears to be a measurement
artifact of e13391e: producer-KV eviction acted as a relief valve,
letting more requests squeeze under the 600s timeout at the (uncosted)
price of cross-turn re-prefill. With the eviction off, PD-disagg is
worse than §3 advertised, not better.

Artifacts:
  analysis/v2/fig4l_lmetric.json     -- 8-arm summary data
  analysis/v2/PD_DISAGG_LMETRIC.md   -- writeup + reproduce recipe
  figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure
  microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script

2026-05-31 20:15:10 +08:00

1 Commits