MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup

- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency
  sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo
  uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every
  static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32).
- analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis
  3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output
  ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse
  ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-01 09:35:25 +08:00
parent 3f997fda14
commit 54e1f5266a
3 changed files with 108 additions and 0 deletions

View File

@@ -0,0 +1,107 @@
# PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
ablation. Supersedes the confounded first cut (held input fixed and sliced the
prefix, so "more reuse" was entangled with "less prefill"). All arms route through
the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
hit rate as colo), so PD loses on *structure*, not on broken cache.
- **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
- **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
- **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
**colo is uncapped** so the reference is always fully measured.
- **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
are healthy. cpfs/venv/data are shared across the boxes.
---
## 1. Reuse / APC axis — fixed real prefill, vary cached prefix
N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
| | delta (real prefill/turn) | output | role |
|---|---|---|---|
| **A** | 2048 | 256 | original |
| **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
| **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |
**PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|---|---|---|---|
| 20 | 1.34 | 1.41 | — |
| 50 | 1.36 | 1.37 | — |
| 67 | **1.47** | **1.49** | **1.27** |
| 80 | 1.31 | 1.23 | 1.25 |
| 90 | 1.15 | 1.01 | — |
| 95 | 0.87 | 0.89 | 0.89 |
![reuse 3-way](../../figs/mb5_pd_ablation/reuse_compare_ABC.png)
**Findings:**
1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
range — halving output barely moves PD's advantage.
2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
3. **Crossover to colo at reuse ~9095% is robust** across all three shapes: PD always loses
the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
context every turn for a few hundred new tokens; colo keeps it local).
*Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **2080%** (carries
findings 12). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
A/B are full-drain — comparable in direction, not in exact PD completion%.
Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).
---
## 2. Concurrency axis — agentic corner, sweep N
in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
PD arms capped 600s, colo uncapped.
| N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
|---|---|---|
| 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
| 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
| 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
| 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
| 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
| 96 | **3072/3072** · 60.0s · 171 | PD **27%** |
| 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
![concurrency](../../figs/mb5_pd_ablation/fig3_concurrency_axis.png)
**Finding:** **colo completes 100% of requests at every concurrency level** it degrades
*gracefully* (latency rises 2.4s81s, nothing dropped). **Every static PD split collapses, and
progressively earlier as N rises**: PD is viable only at N816; by N32 it drops 7099% of
requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
percentiles count successes only, so they *understate* PD read them with the completion column.)
Data: `fig3_conc32k.json`.
*Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
driver is a no-op (a stdout-capture bug), so the full grid ran more points, not fewer.
---
## 3. Reproduce
```bash
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
R=/home/admin/cpfs/wjh/agentic-kv-fresh
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
# concurrency axis (capped):
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
# render (reads the *.json in this dir):
python microbench/fresh_setup/plot_pd_crossover.py
```

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 158 KiB

After

Width:  |  Height:  |  Size: 171 KiB