# PD-colo vs PD-disagg on the real agentic trace — LMetric (cache-aware) clean-stack anchor

**Figure:** [`figs/v2/fig4_lmetric_pd_vs_colo.png`](../../figs/v2/fig4_lmetric_pd_vs_colo.png)
**Data:**   [`analysis/v2/fig4l_lmetric.json`](fig4l_lmetric.json)
**Date:**   2026-05-31 · Hardware: dash1, 8×H20 · Model: Qwen3-Coder-30B-A3B-Instruct
· vLLM 0.18.1 (V1, chunked-prefill on, `e13391e` eviction gated **off**)
· Mooncake 0.3.11 · Routing: cache-aware proxy with **`--policy lmetric`**
· Replayer per-request timeout 600 s.

## TL;DR

On the production agentic trace with the *right* routing baseline (LMetric, cache-aware),
**PD-colo (8× kv_both) keeps 100 % completion on both traces** and matches the daily-bench
expectation (~17 min for the high-load first600s, ~50 min for the full trace, with E2E p50
~3 s and TTFT p50 ~1 s — **3.5–7× better than the original §3 round-robin baseline at the
same wall-clock**). Every static **PD-disagg ratio fails** (14–65 % completion), and the
failure mode rotates predictably with the split — **no static partition has a working
operating point on this workload**. LMetric improves colo dramatically; it does *not*
rescue PD-disagg, confirming the bottleneck is structural (D-pool admission capacity +
multi-turn KV accumulation), not routing. A follow-up linear-policy run with PD-disagg
**wall-capped at 2× the colo wall** (see end of doc) hits the **identical** success-rate
ceiling — confirming the cap is structural, not policy-driven.

## Setup

- Trace: `w600_r0.0015_st30.jsonl` (1214 reqs, 274 sessions, agentic multi-turn,
  contexts up to ~112 k tokens; "first600s" variant = same heavy sessions compressed
  into 600 s → 807 reqs at 3.2× higher arrival rate).
- 8 instances on 8 GPUs.
- `--mode baseline` for colo (plain vLLM); `--mode pdsep --pd-ratio P:D` for the three PD
  splits, all with Mooncake KV transfer.
- Cache-aware proxy with LMetric scoring (`P_tokens × num_requests`) + session affinity
  for multi-turn (the colleague's canonical baseline).

## Results

### first600s (1.35 req/s, high-load stress)

| arm | success | E2E mean / p50 / p90 / p99 | TTFT p90 | TPOT p99 | TPS | wall |
|---|---|---|---|---|---|---|
| **colo (8C)** | **807/807 = 100 %** | 11.1 / 3.27 / 28.6 / 95.9 s | 14.5 s | 388 ms | 226 | 17.0 min |
| pd6 (6:2) | 474/807 = **58.7 %** | 83.2 / 6.75 / 382 / 542 s | 380 s | 19 ms | 40 | 55 min |
| pd4 (4:4) | 348/807 = **43.1 %** | 203 / 215 / 477 / 575 s | 475 s | 25 ms | 15 | 114 min |
| pd2 (2:6) | 180/807 = **22.3 %** | 380 / 536 / 579 / 602 s | 577 s | 18 ms | 34 | 321 min* |

### Full trace (0.42 req/s, original §3 anchor load)

| arm | success | E2E mean / p50 / p90 / p99 | TTFT p90 | TPOT p99 | TPS | wall |
|---|---|---|---|---|---|---|
| **colo (8C)** | **1214/1214 = 100 %** | 10.9 / 3.13 / 29.6 / 93.7 s | 16.9 s | 254 ms | 125 | 49.9 min |
| pd6 (6:2) | 793/1214 = **65.3 %** | 61.9 / 3.70 / 307 / 477 s | 300 s | 18 ms | 46 | 94 min |
| pd4 (4:4) | 533/1214 = **43.9 %** | 131 / 8.22 / 468 / 531 s | 467 s | 21 ms | 13 | 231 min |
| pd2 (2:6) | 169/1214 = **13.9 %** | 195 / 6.82 / 552 / 593 s | 549 s | 13 ms | 1 | 563 min |

\* The pd2 wall-clock is dominated by per-request timeouts (`request_timeout=600 s`)
draining concurrently behind the multi-turn session causality.

## Five clean findings

1. **LMetric+colo is the right baseline.** Full-trace colo wall **49.9 min ≈ the original
   §3 RR's 49.9 min**, but E2E p50 **3.13 s vs §3's 10.8 s (3.5×)** and TTFT p50
   **1.02 s vs §3's 7.0 s (7×)**. Same throughput envelope, far better latency — by virtue
   of cache-aware routing concentrating each session's turns onto one instance for
   prefix-cache reuse. The original §3 RR was an *unfairly weak* colo baseline.

2. **Every static PD-disagg ratio fails on the agentic workload.** Completion drops to
   14–65 %, on *both* traces. The drop is not a high-load artifact (it holds at the
   original §3 arrival rate of 0.42 req/s); it is structural.

3. **Failure mode rotates predictably with the P:D split:**
   - **pd2 (2 producers)** → prefill-bound → 78–86 % TTFT timeouts.
   - **pd6 (2 decode)** → decode-admission-bound → 35–41 % TTFT timeouts.
   - **pd4 (4P+4D)** → both bottlenecks hit → 57 % TTFT timeouts.
   - **No static ratio works.** Colo's elastic 8-GPU pool absorbs whichever phase is
     hot at the moment.

4. **Decode isolation works, but doesn't matter under failure.** TPOT p99 on every PD
   arm is **13–25 ms** — an order of magnitude better than colo's 254–388 ms — but the
   win applies only to the 14–65 % of requests that get admitted. The other 35–86 %
   time out before ever seeing a first token, so the TPOT win is invisible to the end user.

5. **The §3 RR "100 % PD completion" was a measurement artifact.** Original §3
   (contaminated stack, RR routing) reported 100 % completion for pd6/pd4. LMetric on
   the clean stack shows 44–65 %. Most plausible mechanism: `e13391e`'s eviction of
   producer KV on every transfer acted as a **relief valve**, reducing producer-pool
   pressure and letting more requests squeeze under the 600 s timeout — at the (uncosted)
   price of cross-turn re-prefill. With the eviction off, producers retain prefix
   correctly → cache works on PD too → but the cache itself contends for producer
   pool capacity, and the decode-pool admission ceiling tips earlier. **PD-disagg is
   worse on agentic than §3 advertised, not better.**

## Linear-policy + wall-cap follow-up (2026-06-01) — the success ceiling is policy-invariant

To check whether the LMetric routing was secretly handicapping PD-disagg, we re-ran
first600s with the **default `--policy linear`** (cache-aware load score + sticky
session affinity — the original baseline of the cache_aware_proxy stack) and
**wall-capped each PD-disagg arm at 2 × the colo wall** (kill bench.sh + cleanup
GPUs once cap is exceeded, record `records_at_cap`).

| arm | linear success | linear wall | linear @-cap? | lmetric success | lmetric wall |
|---|---|---|---|---|---|
| **colo** | 807/807 = **100 %** | 964 s | — | 807/807 = **100 %** | 1021 s |
| **pd6 (6:2)** | **472/807 = 58 %** | 2280 s | ⊗ cap (706 dispatched) | 474/807 = 59 % | 3325 s |
| **pd4 (4:4)** | **349/807 = 43 %** | 2281 s | ⊗ cap (577 dispatched) | 348/807 = 43 % | 6850 s |
| **pd2 (2:6)** | **176/807 = 22 %** | 2280 s | ⊗ cap (521 dispatched) | 180/807 = 22 % | 19275 s |

→ Figure: [`figs/v2/fig4_linear_vs_lmetric.png`](../../figs/v2/fig4_linear_vs_lmetric.png) ·
data: [`fig4r_linear.json`](fig4r_linear.json)

**Three clean conclusions from the wall-cap experiment:**

1. **The success-rate ceiling is structural, not a routing artifact.** Linear and
   LMetric — two very different scoring policies (one session-sticky cache-aware,
   the other non-sticky pure load) — converge on **identical success rates**
   (58 / 43 / 22 %) for every PD-disagg ratio. Routing has *zero* effect on the
   completion ceiling. The bottleneck is the static P:D split itself.

2. **LMetric's longer wall was wall *wasted on requests that will never succeed*.**
   When the cap is enforced, linear hits the same ceiling in 2280 s as LMetric did
   in 3300–19000 s — the extra wall just slowly times out the unreachable
   requests at 600 s each.

3. **The wall-cap is the right way to bench PD-disagg.** Reporting "completion %"
   without a wall budget is misleading (the bench eventually completes if you wait
   forever, but only by counting timeouts as failures over hours). The honest
   metric is **success-in-2×-colo-wall**, which gives the same answer for both
   routings and matches what an end user would observe on a real SLO.

This **strengthens** the §5 D-pool capacity-ceiling thesis: even with
session-affinity routing serving every request to a warm prefix cache (which
*should* maximise PD's throughput), the static D-pool can't admit more than
~22 / 43 / 58 % of the agentic trace within 2× the colo budget. Colo wins not
because routing is smarter, but because its **elastic pool** absorbs whichever
phase is hot — there's no cap to hit.

---

## Reproduce

```bash
# On dash1, from the main repo /home/admin/cpfs/wjh/agentic-kv:
for TR in w600_r0.0015_st30.jsonl w600_r0.0015_st30_first600s.jsonl; do
  TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_colo_${TR%.*} \
    --mode baseline --policy lmetric
  for r in 6:2 4:4 2:6; do
    TRACE=traces/$TR bash scripts/bench.sh --tag fig4l_lmetric_${r/:/p}_${TR%.*} \
      --mode pdsep --pd-ratio $r --policy lmetric
  done
done

python microbench/fresh_setup/plot_fig4l_lmetric.py
python microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
```

For the linear + 2× wall-cap variant, run colo first to get `wall_clock_s`,
compute `CAP=2*wall`, then launch each PD-disagg arm in the background and
`SIGTERM` it (so bench.sh's cleanup trap fires) once `date +%s` minus the
arm's start time exceeds `CAP`. The capped runs lack `metrics.summary.json`
(replayer was killed before it could write); compute the summary directly from
`metrics.jsonl` (see the inline collector used to build
`analysis/v2/fig4r_linear.json`).

Source `bench.sh` cleans GPUs before each arm and writes `metrics.jsonl` +
`metrics.summary.json` per tag. Aggregation script: see the inline JSON dump used
to build `analysis/v2/fig4l_lmetric.json`.