- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
108 lines
5.3 KiB
Markdown
108 lines
5.3 KiB
Markdown
# PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
|
||
|
||
Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
|
||
ablation. Supersedes the confounded first cut (held input fixed and sliced the
|
||
prefix, so "more reuse" was entangled with "less prefill"). All arms route through
|
||
the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
|
||
hit rate as colo), so PD loses on *structure*, not on broken cache.
|
||
|
||
- **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
|
||
- **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
|
||
- **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
|
||
wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
|
||
**colo is uncapped** so the reference is always fully measured.
|
||
- **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
|
||
this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
|
||
are healthy. cpfs/venv/data are shared across the boxes.
|
||
|
||
---
|
||
|
||
## 1. Reuse / APC axis — fixed real prefill, vary cached prefix
|
||
|
||
N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
|
||
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
|
||
|
||
| | delta (real prefill/turn) | output | role |
|
||
|---|---|---|---|
|
||
| **A** | 2048 | 256 | original |
|
||
| **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
|
||
| **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |
|
||
|
||
**PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
|
||
|
||
| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|
||
|---|---|---|---|
|
||
| 20 | 1.34 | 1.41 | — |
|
||
| 50 | 1.36 | 1.37 | — |
|
||
| 67 | **1.47** | **1.49** | **1.27** |
|
||
| 80 | 1.31 | 1.23 | 1.25 |
|
||
| 90 | 1.15 | 1.01 | — |
|
||
| 95 | 0.87 | 0.89 | 0.89 |
|
||
|
||

|
||
|
||
**Findings:**
|
||
1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
|
||
range — halving output barely moves PD's advantage.
|
||
2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
|
||
below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
|
||
because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
|
||
3. **Crossover to colo at reuse ~90–95% is robust** across all three shapes: PD always loses
|
||
the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
|
||
context every turn for a few hundred new tokens; colo keeps it local).
|
||
|
||
*Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **20–80%** (carries
|
||
findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
|
||
A/B are full-drain — comparable in direction, not in exact PD completion%.
|
||
|
||
Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).
|
||
|
||
---
|
||
|
||
## 2. Concurrency axis — agentic corner, sweep N
|
||
|
||
in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
|
||
PD arms capped 600s, colo uncapped.
|
||
|
||
| N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
|
||
|---|---|---|
|
||
| 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
|
||
| 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
|
||
| 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
|
||
| 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
|
||
| 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
|
||
| 96 | **3072/3072** · 60.0s · 171 | PD **2–7%** |
|
||
| 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
|
||
|
||

|
||
|
||
**Finding:** **colo completes 100% of requests at every concurrency level** — it degrades
|
||
*gracefully* (latency rises 2.4s→81s, nothing dropped). **Every static PD split collapses, and
|
||
progressively earlier as N rises**: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of
|
||
requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
|
||
time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
|
||
percentiles count successes only, so they *understate* PD — read them with the completion column.)
|
||
|
||
Data: `fig3_conc32k.json`.
|
||
|
||
*Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
|
||
not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
|
||
driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.
|
||
|
||
---
|
||
|
||
## 3. Reproduce
|
||
|
||
```bash
|
||
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
|
||
R=/home/admin/cpfs/wjh/agentic-kv-fresh
|
||
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
|
||
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||
# concurrency axis (capped):
|
||
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
|
||
# render (reads the *.json in this dir):
|
||
python microbench/fresh_setup/plot_pd_crossover.py
|
||
```
|