MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup
- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
107
analysis/mb5_pd_ablation/README.md
Normal file
107
analysis/mb5_pd_ablation/README.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
|
||||||
|
|
||||||
|
Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
|
||||||
|
ablation. Supersedes the confounded first cut (held input fixed and sliced the
|
||||||
|
prefix, so "more reuse" was entangled with "less prefill"). All arms route through
|
||||||
|
the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
|
||||||
|
hit rate as colo), so PD loses on *structure*, not on broken cache.
|
||||||
|
|
||||||
|
- **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
|
||||||
|
- **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
|
||||||
|
- **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
|
||||||
|
wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
|
||||||
|
**colo is uncapped** so the reference is always fully measured.
|
||||||
|
- **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
|
||||||
|
this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
|
||||||
|
are healthy. cpfs/venv/data are shared across the boxes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Reuse / APC axis — fixed real prefill, vary cached prefix
|
||||||
|
|
||||||
|
N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
|
||||||
|
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
|
||||||
|
|
||||||
|
| | delta (real prefill/turn) | output | role |
|
||||||
|
|---|---|---|---|
|
||||||
|
| **A** | 2048 | 256 | original |
|
||||||
|
| **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
|
||||||
|
| **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |
|
||||||
|
|
||||||
|
**PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
|
||||||
|
|
||||||
|
| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|
||||||
|
|---|---|---|---|
|
||||||
|
| 20 | 1.34 | 1.41 | — |
|
||||||
|
| 50 | 1.36 | 1.37 | — |
|
||||||
|
| 67 | **1.47** | **1.49** | **1.27** |
|
||||||
|
| 80 | 1.31 | 1.23 | 1.25 |
|
||||||
|
| 90 | 1.15 | 1.01 | — |
|
||||||
|
| 95 | 0.87 | 0.89 | 0.89 |
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Findings:**
|
||||||
|
1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
|
||||||
|
range — halving output barely moves PD's advantage.
|
||||||
|
2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
|
||||||
|
below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
|
||||||
|
because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
|
||||||
|
3. **Crossover to colo at reuse ~90–95% is robust** across all three shapes: PD always loses
|
||||||
|
the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
|
||||||
|
context every turn for a few hundred new tokens; colo keeps it local).
|
||||||
|
|
||||||
|
*Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **20–80%** (carries
|
||||||
|
findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
|
||||||
|
A/B are full-drain — comparable in direction, not in exact PD completion%.
|
||||||
|
|
||||||
|
Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Concurrency axis — agentic corner, sweep N
|
||||||
|
|
||||||
|
in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
|
||||||
|
PD arms capped 600s, colo uncapped.
|
||||||
|
|
||||||
|
| N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
|
||||||
|
|---|---|---|
|
||||||
|
| 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
|
||||||
|
| 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
|
||||||
|
| 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
|
||||||
|
| 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
|
||||||
|
| 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
|
||||||
|
| 96 | **3072/3072** · 60.0s · 171 | PD **2–7%** |
|
||||||
|
| 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Finding:** **colo completes 100% of requests at every concurrency level** — it degrades
|
||||||
|
*gracefully* (latency rises 2.4s→81s, nothing dropped). **Every static PD split collapses, and
|
||||||
|
progressively earlier as N rises**: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of
|
||||||
|
requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
|
||||||
|
time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
|
||||||
|
percentiles count successes only, so they *understate* PD — read them with the completion column.)
|
||||||
|
|
||||||
|
Data: `fig3_conc32k.json`.
|
||||||
|
|
||||||
|
*Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
|
||||||
|
not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
|
||||||
|
driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Reproduce
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
|
||||||
|
R=/home/admin/cpfs/wjh/agentic-kv-fresh
|
||||||
|
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
|
||||||
|
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||||||
|
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||||||
|
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
|
||||||
|
# concurrency axis (capped):
|
||||||
|
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
|
||||||
|
# render (reads the *.json in this dir):
|
||||||
|
python microbench/fresh_setup/plot_pd_crossover.py
|
||||||
|
```
|
||||||
1
analysis/mb5_pd_ablation/fig3_conc32k.json
Normal file
1
analysis/mb5_pd_ablation/fig3_conc32k.json
Normal file
File diff suppressed because one or more lines are too long
Binary file not shown.
|
Before Width: | Height: | Size: 158 KiB After Width: | Height: | Size: 171 KiB |
Reference in New Issue
Block a user