MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup

- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 09:35:25 +08:00
parent 3f997fda14
commit 54e1f5266a
3 changed files with 108 additions and 0 deletions
--- a/analysis/mb5_pd_ablation/README.md
+++ b/analysis/mb5_pd_ablation/README.md
@@ -0,0 +1,107 @@
 # PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
 Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
 ablation. Supersedes the confounded first cut (held input fixed and sliced the
 prefix, so "more reuse" was entangled with "less prefill"). All arms route through
 the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
 hit rate as colo), so PD loses on *structure*, not on broken cache.
 - **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
 - **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
 - **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
  wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
  **colo is uncapped** so the reference is always fully measured.
 - **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
  this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
  are healthy. cpfs/venv/data are shared across the boxes.
 ---
 ## 1. Reuse / APC axis — fixed real prefill, vary cached prefix
 N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
 prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
 | | delta (real prefill/turn) | output | role |
 |---|---|---|---|
 | **A** | 2048 | 256 | original |
 | **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
 | **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |
 **PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
 | reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
 |---|---|---|---|
 | 20 | 1.34 | 1.41 | — |
 | 50 | 1.36 | 1.37 | — |
 | 67 | **1.47** | **1.49** | **1.27** |
 | 80 | 1.31 | 1.23 | 1.25 |
 | 90 | 1.15 | 1.01 | — |
 | 95 | 0.87 | 0.89 | 0.89 |
 ![reuse 3-way](../../figs/mb5_pd_ablation/reuse_compare_ABC.png)
 **Findings:**
 1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
   range — halving output barely moves PD's advantage.
 2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
   below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
   because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
 3. **Crossover to colo at reuse ~90–95% is robust** across all three shapes: PD always loses
   the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
   context every turn for a few hundred new tokens; colo keeps it local).
 *Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **20–80%** (carries
 findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
 A/B are full-drain — comparable in direction, not in exact PD completion%.
 Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).
 ---
 ## 2. Concurrency axis — agentic corner, sweep N
 in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
 PD arms capped 600s, colo uncapped.
 | N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
 |---|---|---|
 | 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
 | 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
 | 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
 | 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
 | 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
 | 96 | **3072/3072** · 60.0s · 171 | PD **2–7%** |
 | 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
 ![concurrency](../../figs/mb5_pd_ablation/fig3_concurrency_axis.png)
 **Finding:** **colo completes 100% of requests at every concurrency level** — it degrades
 *gracefully* (latency rises 2.4s→81s, nothing dropped). **Every static PD split collapses, and
 progressively earlier as N rises**: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of
 requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
 time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
 percentiles count successes only, so they *understate* PD — read them with the completion column.)
 Data: `fig3_conc32k.json`.
 *Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
 not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
 driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.
 ---
 ## 3. Reproduce
 ```bash
 # on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
 R=/home/admin/cpfs/wjh/agentic-kv-fresh
 # reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
 ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
 ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
 ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
 # concurrency axis (capped):
 ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
 # render (reads the *.json in this dir):
 python microbench/fresh_setup/plot_pd_crossover.py
 ```
--- a/analysis/mb5_pd_ablation/fig3_conc32k.json
+++ b/analysis/mb5_pd_ablation/fig3_conc32k.json
--- a/figs/mb5_pd_ablation/fig3_concurrency_axis.png
+++ b/figs/mb5_pd_ablation/fig3_concurrency_axis.png