Files
agentic-kvc/analysis/v2/PD_DISAGG_LMETRIC.md
Gahow Wang 32f7f55990 v2: linear (default cache-aware) baseline + 2x wall-cap on first600s
Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware
load + sticky session affinity, the cache_aware_proxy default) and cap
each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap
is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack
metrics.summary.json so the analysis script computes from raw
metrics.jsonl).

Headline: the success-rate ceiling is policy-invariant.

  arm        linear (capped at 2x)    lmetric (uncapped)
  colo       807/807 = 100%, 964s     807/807 = 100%, 1021s
  pd6 (6:2)  472/807 =  58%, 2280s ⊗  474/807 =  59%, 3325s
  pd4 (4:4)  349/807 =  43%, 2281s ⊗  348/807 =  43%, 6850s
  pd2 (2:6)  176/807 =  22%, 2280s ⊗  180/807 =  22%, 19275s

Routing affects only how much wall is wasted timing out unreachable
requests at 600s each. Linear hits the same ceiling in 2280s as
LMetric does in 3300-19000s. This *strengthens* the §5 D-pool
capacity-ceiling thesis -- the cap is structural, not a routing
artifact.

Artifacts:
  analysis/v2/fig4r_linear.json          -- 4-arm linear summary
  analysis/v2/PD_DISAGG_LMETRIC.md       -- extended with wall-cap section
  figs/v2/fig4_linear_vs_lmetric.png     -- 3-panel side-by-side comparison
  microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
2026-06-01 00:55:40 +08:00

9.2 KiB
Raw Permalink Blame History

PD-colo vs PD-disagg on the real agentic trace — LMetric (cache-aware) clean-stack anchor

Figure: figs/v2/fig4_lmetric_pd_vs_colo.png Data: analysis/v2/fig4l_lmetric.json Date: 2026-05-31 · Hardware: dash1, 8×H20 · Model: Qwen3-Coder-30B-A3B-Instruct · vLLM 0.18.1 (V1, chunked-prefill on, e13391e eviction gated off) · Mooncake 0.3.11 · Routing: cache-aware proxy with --policy lmetric · Replayer per-request timeout 600 s.

TL;DR

On the production agentic trace with the right routing baseline (LMetric, cache-aware), PD-colo (8× kv_both) keeps 100 % completion on both traces and matches the daily-bench expectation (~17 min for the high-load first600s, ~50 min for the full trace, with E2E p50 ~3 s and TTFT p50 ~1 s — 3.57× better than the original §3 round-robin baseline at the same wall-clock). Every static PD-disagg ratio fails (1465 % completion), and the failure mode rotates predictably with the split — no static partition has a working operating point on this workload. LMetric improves colo dramatically; it does not rescue PD-disagg, confirming the bottleneck is structural (D-pool admission capacity + multi-turn KV accumulation), not routing. A follow-up linear-policy run with PD-disagg wall-capped at 2× the colo wall (see end of doc) hits the identical success-rate ceiling — confirming the cap is structural, not policy-driven.

Setup

  • Trace: w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions, agentic multi-turn, contexts up to ~112 k tokens; "first600s" variant = same heavy sessions compressed into 600 s → 807 reqs at 3.2× higher arrival rate).
  • 8 instances on 8 GPUs.
  • --mode baseline for colo (plain vLLM); --mode pdsep --pd-ratio P:D for the three PD splits, all with Mooncake KV transfer.
  • Cache-aware proxy with LMetric scoring (P_tokens × num_requests) + session affinity for multi-turn (the colleague's canonical baseline).

Results

first600s (1.35 req/s, high-load stress)

arm success E2E mean / p50 / p90 / p99 TTFT p90 TPOT p99 TPS wall
colo (8C) 807/807 = 100 % 11.1 / 3.27 / 28.6 / 95.9 s 14.5 s 388 ms 226 17.0 min
pd6 (6:2) 474/807 = 58.7 % 83.2 / 6.75 / 382 / 542 s 380 s 19 ms 40 55 min
pd4 (4:4) 348/807 = 43.1 % 203 / 215 / 477 / 575 s 475 s 25 ms 15 114 min
pd2 (2:6) 180/807 = 22.3 % 380 / 536 / 579 / 602 s 577 s 18 ms 34 321 min*

Full trace (0.42 req/s, original §3 anchor load)

arm success E2E mean / p50 / p90 / p99 TTFT p90 TPOT p99 TPS wall
colo (8C) 1214/1214 = 100 % 10.9 / 3.13 / 29.6 / 93.7 s 16.9 s 254 ms 125 49.9 min
pd6 (6:2) 793/1214 = 65.3 % 61.9 / 3.70 / 307 / 477 s 300 s 18 ms 46 94 min
pd4 (4:4) 533/1214 = 43.9 % 131 / 8.22 / 468 / 531 s 467 s 21 ms 13 231 min
pd2 (2:6) 169/1214 = 13.9 % 195 / 6.82 / 552 / 593 s 549 s 13 ms 1 563 min

* The pd2 wall-clock is dominated by per-request timeouts (request_timeout=600 s) draining concurrently behind the multi-turn session causality.

Five clean findings

  1. LMetric+colo is the right baseline. Full-trace colo wall 49.9 min ≈ the original §3 RR's 49.9 min, but E2E p50 3.13 s vs §3's 10.8 s (3.5×) and TTFT p50 1.02 s vs §3's 7.0 s (7×). Same throughput envelope, far better latency — by virtue of cache-aware routing concentrating each session's turns onto one instance for prefix-cache reuse. The original §3 RR was an unfairly weak colo baseline.

  2. Every static PD-disagg ratio fails on the agentic workload. Completion drops to 1465 %, on both traces. The drop is not a high-load artifact (it holds at the original §3 arrival rate of 0.42 req/s); it is structural.

  3. Failure mode rotates predictably with the P:D split:

    • pd2 (2 producers) → prefill-bound → 7886 % TTFT timeouts.
    • pd6 (2 decode) → decode-admission-bound → 3541 % TTFT timeouts.
    • pd4 (4P+4D) → both bottlenecks hit → 57 % TTFT timeouts.
    • No static ratio works. Colo's elastic 8-GPU pool absorbs whichever phase is hot at the moment.
  4. Decode isolation works, but doesn't matter under failure. TPOT p99 on every PD arm is 1325 ms — an order of magnitude better than colo's 254388 ms — but the win applies only to the 1465 % of requests that get admitted. The other 3586 % time out before ever seeing a first token, so the TPOT win is invisible to the end user.

  5. The §3 RR "100 % PD completion" was a measurement artifact. Original §3 (contaminated stack, RR routing) reported 100 % completion for pd6/pd4. LMetric on the clean stack shows 4465 %. Most plausible mechanism: e13391e's eviction of producer KV on every transfer acted as a relief valve, reducing producer-pool pressure and letting more requests squeeze under the 600 s timeout — at the (uncosted) price of cross-turn re-prefill. With the eviction off, producers retain prefix correctly → cache works on PD too → but the cache itself contends for producer pool capacity, and the decode-pool admission ceiling tips earlier. PD-disagg is worse on agentic than §3 advertised, not better.

Linear-policy + wall-cap follow-up (2026-06-01) — the success ceiling is policy-invariant

To check whether the LMetric routing was secretly handicapping PD-disagg, we re-ran first600s with the default --policy linear (cache-aware load score + sticky session affinity — the original baseline of the cache_aware_proxy stack) and wall-capped each PD-disagg arm at 2 × the colo wall (kill bench.sh + cleanup GPUs once cap is exceeded, record records_at_cap).

arm linear success linear wall linear @-cap? lmetric success lmetric wall
colo 807/807 = 100 % 964 s 807/807 = 100 % 1021 s
pd6 (6:2) 472/807 = 58 % 2280 s ⊗ cap (706 dispatched) 474/807 = 59 % 3325 s
pd4 (4:4) 349/807 = 43 % 2281 s ⊗ cap (577 dispatched) 348/807 = 43 % 6850 s
pd2 (2:6) 176/807 = 22 % 2280 s ⊗ cap (521 dispatched) 180/807 = 22 % 19275 s

→ Figure: figs/v2/fig4_linear_vs_lmetric.png · data: fig4r_linear.json

Three clean conclusions from the wall-cap experiment:

  1. The success-rate ceiling is structural, not a routing artifact. Linear and LMetric — two very different scoring policies (one session-sticky cache-aware, the other non-sticky pure load) — converge on identical success rates (58 / 43 / 22 %) for every PD-disagg ratio. Routing has zero effect on the completion ceiling. The bottleneck is the static P:D split itself.

  2. LMetric's longer wall was wall wasted on requests that will never succeed. When the cap is enforced, linear hits the same ceiling in 2280 s as LMetric did in 330019000 s — the extra wall just slowly times out the unreachable requests at 600 s each.

  3. The wall-cap is the right way to bench PD-disagg. Reporting "completion %" without a wall budget is misleading (the bench eventually completes if you wait forever, but only by counting timeouts as failures over hours). The honest metric is success-in-2×-colo-wall, which gives the same answer for both routings and matches what an end user would observe on a real SLO.

This strengthens the §5 D-pool capacity-ceiling thesis: even with session-affinity routing serving every request to a warm prefix cache (which should maximise PD's throughput), the static D-pool can't admit more than ~22 / 43 / 58 % of the agentic trace within 2× the colo budget. Colo wins not because routing is smarter, but because its elastic pool absorbs whichever phase is hot — there's no cap to hit.


Reproduce

# On dash1, from the main repo /home/admin/cpfs/wjh/agentic-kv:
for TR in w600_r0.0015_st30.jsonl w600_r0.0015_st30_first600s.jsonl; do
  TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_colo_${TR%.*} \
    --mode baseline --policy lmetric
  for r in 6:2 4:4 2:6; do
    TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_${r/:/p}_${TR%.*} \
      --mode pdsep --pd-ratio $r --policy lmetric
  done
done

python microbench/fresh_setup/plot_fig4l_lmetric.py
python microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py

For the linear + 2× wall-cap variant, run colo first to get wall_clock_s, compute CAP=2*wall, then launch each PD-disagg arm in the background and SIGTERM it (so bench.sh's cleanup trap fires) once date +%s minus the arm's start time exceeds CAP. The capped runs lack metrics.summary.json (replayer was killed before it could write); compute the summary directly from metrics.jsonl (see the inline collector used to build analysis/v2/fig4r_linear.json).

Source bench.sh cleans GPUs before each arm and writes metrics.jsonl + metrics.summary.json per tag. Aggregation script: see the inline JSON dump used to build analysis/v2/fig4l_lmetric.json.