agentic-kvc/analysis/mb5_pd_ablation/README.md

# PD-disagg vs colocation — controlled reuse & concurrency axes (v2)

Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
ablation. Supersedes the confounded first cut (held input fixed and sliced the
prefix, so "more reuse" was entangled with "less prefill"). All arms route through
the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
hit rate as colo), so PD loses on *structure*, not on broken cache.

- **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
- **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
- **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
  wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
  **colo is uncapped** so the reference is always fully measured.
- **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
  this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
  are healthy. cpfs/venv/data are shared across the boxes.

---

## 1. Reuse / APC axis — fixed real prefill, vary cached prefix

N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:

| | delta (real prefill/turn) | output | role |
|---|---|---|---|
| **A** | 2048 | 256 | original |
| **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
| **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |

**PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):

| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|---|---|---|---|
| 20 | 1.34 | 1.41 | — |
| 50 | 1.36 | 1.37 | — |
| 67 | **1.47** | **1.49** | **1.27** |
| 80 | 1.31 | 1.23 | 1.25 |
| 90 | 1.15 | 1.01 | — |
| 95 | 0.87 | 0.89 | 0.89 |

![reuse 3-way](../../figs/mb5_pd_ablation/reuse_compare_ABC.png)

**Findings:**
1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
   range — halving output barely moves PD's advantage.
2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
   below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
   because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
3. **Crossover to colo at reuse ~90–95% is robust** across all three shapes: PD always loses
   the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
   context every turn for a few hundred new tokens; colo keeps it local).

*Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **20–80%** (carries
findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
A/B are full-drain — comparable in direction, not in exact PD completion%.

Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).

---

## 2. Concurrency axis — agentic corner, sweep N

in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
PD arms capped 600s, colo uncapped.

| N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
|---|---|---|
| 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
| 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
| 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
| 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
| 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
| 96 | **3072/3072** · 60.0s · 171 | PD **2–7%** |
| 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |

![concurrency](../../figs/mb5_pd_ablation/fig3_concurrency_axis.png)

**Finding:** **colo completes 100% of requests at every concurrency level** — it degrades
*gracefully* (latency rises 2.4s→81s, nothing dropped). **Every static PD split collapses, and
progressively earlier as N rises**: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of
requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
percentiles count successes only, so they *understate* PD — read them with the completion column.)

Data: `fig3_conc32k.json`.

*Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.

---

## 3. Reproduce

```bash
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
R=/home/admin/cpfs/wjh/agentic-kv-fresh
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
# concurrency axis (capped):
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
# render (reads the *.json in this dir):
python microbench/fresh_setup/plot_pd_crossover.py
```