Files
agentic-kvc/analysis/mb5_pd_ablation/README.md
Gahow Wang 54e1f5266a MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup
- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency
  sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo
  uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every
  static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32).
- analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis
  3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output
  ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse
  ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 09:35:25 +08:00

108 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
Self-contained results for the **controlled-variable** redo of the MB5 PD-vs-colo
ablation. Supersedes the confounded first cut (held input fixed and sliced the
prefix, so "more reuse" was entangled with "less prefill"). All arms route through
the proxy at fair **APC parity** (session-routed producers reach the same prefix-cache
hit rate as colo), so PD loses on *structure*, not on broken cache.
- **Config arms:** `colo` = 8×kv_both (8C-proxy, session-affinity); PD = `6P+2D / 4P+4D / 2P+6D`.
- **Driver:** closed-loop N (`REPLAY_MAX_INFLIGHT`) + fixed think-time; `gen_synthetic_trace.py --mode regular`.
- **PD-arm wall-cap:** collapsed PD arms drain pathologically slowly, so PD arms run with a
wall-deadline (`REPLAY_MAX_DURATION`; un-run turns counted as failures → honest completion%);
**colo is uncapped** so the reference is always fully measured.
- **Hardware:** run on **dash2** (8×H20). dash0's RDMA NICs were faulted for Mooncake during
this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs
are healthy. cpfs/venv/data are shared across the boxes.
---
## 1. Reuse / APC axis — fixed real prefill, vary cached prefix
N=8. Hold the **real new-prefill work per turn constant** (`--delta-len`) and grow the cached
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
| | delta (real prefill/turn) | output | role |
|---|---|---|---|
| **A** | 2048 | 256 | original |
| **C** | 2048 | 128 | A vs C = pure **output** 256→128 |
| **B** | 1024 | 128 | C vs B = pure **delta** 2048→1024 |
**PD-best advantage** = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|---|---|---|---|
| 20 | 1.34 | 1.41 | — |
| 50 | 1.36 | 1.37 | — |
| 67 | **1.47** | **1.49** | **1.27** |
| 80 | 1.31 | 1.23 | 1.25 |
| 90 | 1.15 | 1.01 | — |
| 95 | 0.87 | 0.89 | 0.89 |
![reuse 3-way](../../figs/mb5_pd_ablation/reuse_compare_ABC.png)
**Findings:**
1. **Output length is ~negligible.** A and C (same delta) track each other across the whole
range — halving output barely moves PD's advantage.
2. **Delta (real prefill/turn) is the dominant shape factor.** B (delta=1024) sits clearly
below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win,
because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
3. **Crossover to colo at reuse ~9095% is robust** across all three shapes: PD always loses
the high-reuse / large-resident-context corner (it must KV-transfer the whole resident
context every turn for a few hundred new tokens; colo keeps it local).
*Caveat:* the clean, uncapped, 100%-completion comparison region is reuse **2080%** (carries
findings 12). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while
A/B are full-drain — comparable in direction, not in exact PD completion%.
Data: `fig1_reuse_fixed.json` (A), `fig1_reuse_d2048_o128.json` (C), `fig1_reuse_d1024_o128.json` (B).
---
## 2. Concurrency axis — agentic corner, sweep N
in=32768 (prefix 32256 + delta 512, **reuse 0.984**), out=128; closed-loop N ∈ {8,16,32,48,64,96,128};
PD arms capped 600s, colo uncapped.
| N | **colo** completion · E2E-mean · TPS | best PD-arm completion |
|---|---|---|
| 8 | **256/256** · 2.4s · 326 | 6P+2D 256/256 |
| 16 | **512/512** · 3.5s · 462 | 6P+2D 439/512 (86%) |
| 32 | **1024/1024** · 13.3s · 190 | all PD **<27%** |
| 48 | **1536/1536** · 24.9s · 168 | all PD <32% |
| 64 | **2048/2048** · 38.4s · 166 | all PD <31% |
| 96 | **3072/3072** · 60.0s · 171 | PD **27%** |
| 128 | **4096/4096** · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
![concurrency](../../figs/mb5_pd_ablation/fig3_concurrency_axis.png)
**Finding:** **colo completes 100% of requests at every concurrency level** it degrades
*gracefully* (latency rises 2.4s81s, nothing dropped). **Every static PD split collapses, and
progressively earlier as N rises**: PD is viable only at N816; by N32 it drops 7099% of
requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the
time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency
percentiles count successes only, so they *understate* PD read them with the completion column.)
Data: `fig3_conc32k.json`.
*Caveat:* N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does
not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the
driver is a no-op (a stdout-capture bug), so the full grid ran more points, not fewer.
---
## 3. Reproduce
```bash
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
R=/home/admin/cpfs/wjh/agentic-kv-fresh
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
# concurrency axis (capped):
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
# render (reads the *.json in this dir):
python microbench/fresh_setup/plot_pd_crossover.py
```