Anchor experiment for the clean-stack PD comparison using the canonical
cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two
traces x four arms = eight runs on dash1.
Headline: with the right routing baseline (LMetric), PD-colo holds 100%
completion on both traces while every static PD-disagg ratio fails
(14-65% completion), and the failure mode rotates with the split --
no static partition has a working operating point on this workload.
LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR
7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is
structural (D-pool admission + multi-turn KV accumulation), not routing.
Completion matrix:
first600s full
colo 100% 100%
pd6 (6:2) 58.7% 65.3% (decode-bound)
pd4 (4:4) 43.1% 43.9% (both bottlenecks)
pd2 (2:6) 22.3% 13.9% (prefill-bound)
The original §3 RR "100% PD completion" appears to be a measurement
artifact of e13391e: producer-KV eviction acted as a relief valve,
letting more requests squeeze under the 600s timeout at the (uncosted)
price of cross-turn re-prefill. With the eviction off, PD-disagg is
worse than §3 advertised, not better.
Artifacts:
analysis/v2/fig4l_lmetric.json -- 8-arm summary data
analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe
figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure
microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script
5.9 KiB
PD-colo vs PD-disagg on the real agentic trace — LMetric (cache-aware) clean-stack anchor
Figure: figs/v2/fig4_lmetric_pd_vs_colo.png
Data: analysis/v2/fig4l_lmetric.json
Date: 2026-05-31 · Hardware: dash1, 8×H20 · Model: Qwen3-Coder-30B-A3B-Instruct
· vLLM 0.18.1 (V1, chunked-prefill on, e13391e eviction gated off)
· Mooncake 0.3.11 · Routing: cache-aware proxy with --policy lmetric
· Replayer per-request timeout 600 s.
TL;DR
On the production agentic trace with the right routing baseline (LMetric, cache-aware), PD-colo (8× kv_both) keeps 100 % completion on both traces and matches the daily-bench expectation (~17 min for the high-load first600s, ~50 min for the full trace, with E2E p50 ~3 s and TTFT p50 ~1 s — 3.5–7× better than the original §3 round-robin baseline at the same wall-clock). Every static PD-disagg ratio fails (14–65 % completion), and the failure mode rotates predictably with the split — no static partition has a working operating point on this workload. LMetric improves colo dramatically; it does not rescue PD-disagg, confirming the bottleneck is structural (D-pool admission capacity + multi-turn KV accumulation), not routing.
Setup
- Trace:
w600_r0.0015_st30.jsonl(1214 reqs, 274 sessions, agentic multi-turn, contexts up to ~112 k tokens; "first600s" variant = same heavy sessions compressed into 600 s → 807 reqs at 3.2× higher arrival rate). - 8 instances on 8 GPUs.
--mode baselinefor colo (plain vLLM);--mode pdsep --pd-ratio P:Dfor the three PD splits, all with Mooncake KV transfer.- Cache-aware proxy with LMetric scoring (
P_tokens × num_requests) + session affinity for multi-turn (the colleague's canonical baseline).
Results
first600s (1.35 req/s, high-load stress)
| arm | success | E2E mean / p50 / p90 / p99 | TTFT p90 | TPOT p99 | TPS | wall |
|---|---|---|---|---|---|---|
| colo (8C) | 807/807 = 100 % | 11.1 / 3.27 / 28.6 / 95.9 s | 14.5 s | 388 ms | 226 | 17.0 min |
| pd6 (6:2) | 474/807 = 58.7 % | 83.2 / 6.75 / 382 / 542 s | 380 s | 19 ms | 40 | 55 min |
| pd4 (4:4) | 348/807 = 43.1 % | 203 / 215 / 477 / 575 s | 475 s | 25 ms | 15 | 114 min |
| pd2 (2:6) | 180/807 = 22.3 % | 380 / 536 / 579 / 602 s | 577 s | 18 ms | 34 | 321 min* |
Full trace (0.42 req/s, original §3 anchor load)
| arm | success | E2E mean / p50 / p90 / p99 | TTFT p90 | TPOT p99 | TPS | wall |
|---|---|---|---|---|---|---|
| colo (8C) | 1214/1214 = 100 % | 10.9 / 3.13 / 29.6 / 93.7 s | 16.9 s | 254 ms | 125 | 49.9 min |
| pd6 (6:2) | 793/1214 = 65.3 % | 61.9 / 3.70 / 307 / 477 s | 300 s | 18 ms | 46 | 94 min |
| pd4 (4:4) | 533/1214 = 43.9 % | 131 / 8.22 / 468 / 531 s | 467 s | 21 ms | 13 | 231 min |
| pd2 (2:6) | 169/1214 = 13.9 % | 195 / 6.82 / 552 / 593 s | 549 s | 13 ms | 1 | 563 min |
* The pd2 wall-clock is dominated by per-request timeouts (request_timeout=600 s)
draining concurrently behind the multi-turn session causality.
Five clean findings
-
LMetric+colo is the right baseline. Full-trace colo wall 49.9 min ≈ the original §3 RR's 49.9 min, but E2E p50 3.13 s vs §3's 10.8 s (3.5×) and TTFT p50 1.02 s vs §3's 7.0 s (7×). Same throughput envelope, far better latency — by virtue of cache-aware routing concentrating each session's turns onto one instance for prefix-cache reuse. The original §3 RR was an unfairly weak colo baseline.
-
Every static PD-disagg ratio fails on the agentic workload. Completion drops to 14–65 %, on both traces. The drop is not a high-load artifact (it holds at the original §3 arrival rate of 0.42 req/s); it is structural.
-
Failure mode rotates predictably with the P:D split:
- pd2 (2 producers) → prefill-bound → 78–86 % TTFT timeouts.
- pd6 (2 decode) → decode-admission-bound → 35–41 % TTFT timeouts.
- pd4 (4P+4D) → both bottlenecks hit → 57 % TTFT timeouts.
- No static ratio works. Colo's elastic 8-GPU pool absorbs whichever phase is hot at the moment.
-
Decode isolation works, but doesn't matter under failure. TPOT p99 on every PD arm is 13–25 ms — an order of magnitude better than colo's 254–388 ms — but the win applies only to the 14–65 % of requests that get admitted. The other 35–86 % time out before ever seeing a first token, so the TPOT win is invisible to the end user.
-
The §3 RR "100 % PD completion" was a measurement artifact. Original §3 (contaminated stack, RR routing) reported 100 % completion for pd6/pd4. LMetric on the clean stack shows 44–65 %. Most plausible mechanism:
e13391e's eviction of producer KV on every transfer acted as a relief valve, reducing producer-pool pressure and letting more requests squeeze under the 600 s timeout — at the (uncosted) price of cross-turn re-prefill. With the eviction off, producers retain prefix correctly → cache works on PD too → but the cache itself contends for producer pool capacity, and the decode-pool admission ceiling tips earlier. PD-disagg is worse on agentic than §3 advertised, not better.
Reproduce
# On dash1, from the main repo /home/admin/cpfs/wjh/agentic-kv:
for TR in w600_r0.0015_st30.jsonl w600_r0.0015_st30_first600s.jsonl; do
TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_colo_${TR%.*} \
--mode baseline --policy lmetric
for r in 6:2 4:4 2:6; do
TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_${r/:/p}_${TR%.*} \
--mode pdsep --pd-ratio $r --policy lmetric
done
done
python microbench/fresh_setup/plot_fig4l_lmetric.py
Source bench.sh cleans GPUs before each arm and writes metrics.jsonl +
metrics.summary.json per tag. Aggregation script: see the inline JSON dump used
to build analysis/v2/fig4l_lmetric.json.