- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.3 KiB
PD-disagg vs colocation — controlled reuse & concurrency axes (v2)
Self-contained results for the controlled-variable redo of the MB5 PD-vs-colo ablation. Supersedes the confounded first cut (held input fixed and sliced the prefix, so "more reuse" was entangled with "less prefill"). All arms route through the proxy at fair APC parity (session-routed producers reach the same prefix-cache hit rate as colo), so PD loses on structure, not on broken cache.
- Config arms:
colo= 8×kv_both (8C-proxy, session-affinity); PD =6P+2D / 4P+4D / 2P+6D. - Driver: closed-loop N (
REPLAY_MAX_INFLIGHT) + fixed think-time;gen_synthetic_trace.py --mode regular. - PD-arm wall-cap: collapsed PD arms drain pathologically slowly, so PD arms run with a
wall-deadline (
REPLAY_MAX_DURATION; un-run turns counted as failures → honest completion%); colo is uncapped so the reference is always fully measured. - Hardware: run on dash2 (8×H20). dash0's RDMA NICs were faulted for Mooncake during this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs are healthy. cpfs/venv/data are shared across the boxes.
1. Reuse / APC axis — fixed real prefill, vary cached prefix
N=8. Hold the real new-prefill work per turn constant (--delta-len) and grow the cached
prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:
| delta (real prefill/turn) | output | role | |
|---|---|---|---|
| A | 2048 | 256 | original |
| C | 2048 | 128 | A vs C = pure output 256→128 |
| B | 1024 | 128 | C vs B = pure delta 2048→1024 |
PD-best advantage = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):
| reuse% | A d2048/o256 | C d2048/o128 | B d1024/o128 |
|---|---|---|---|
| 20 | 1.34 | 1.41 | — |
| 50 | 1.36 | 1.37 | — |
| 67 | 1.47 | 1.49 | 1.27 |
| 80 | 1.31 | 1.23 | 1.25 |
| 90 | 1.15 | 1.01 | — |
| 95 | 0.87 | 0.89 | 0.89 |
Findings:
- Output length is ~negligible. A and C (same delta) track each other across the whole range — halving output barely moves PD's advantage.
- Delta (real prefill/turn) is the dominant shape factor. B (delta=1024) sits clearly below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win, because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
- Crossover to colo at reuse ~90–95% is robust across all three shapes: PD always loses the high-reuse / large-resident-context corner (it must KV-transfer the whole resident context every turn for a few hundred new tokens; colo keeps it local).
Caveat: the clean, uncapped, 100%-completion comparison region is reuse 20–80% (carries findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while A/B are full-drain — comparable in direction, not in exact PD completion%.
Data: fig1_reuse_fixed.json (A), fig1_reuse_d2048_o128.json (C), fig1_reuse_d1024_o128.json (B).
2. Concurrency axis — agentic corner, sweep N
in=32768 (prefix 32256 + delta 512, reuse 0.984), out=128; closed-loop N ∈ {8,16,32,48,64,96,128}; PD arms capped 600s, colo uncapped.
| N | colo completion · E2E-mean · TPS | best PD-arm completion |
|---|---|---|
| 8 | 256/256 · 2.4s · 326 | 6P+2D 256/256 |
| 16 | 512/512 · 3.5s · 462 | 6P+2D 439/512 (86%) |
| 32 | 1024/1024 · 13.3s · 190 | all PD <27% |
| 48 | 1536/1536 · 24.9s · 168 | all PD <32% |
| 64 | 2048/2048 · 38.4s · 166 | all PD <31% |
| 96 | 3072/3072 · 60.0s · 171 | PD 2–7% |
| 128 | 4096/4096 · 80.8s · 181 | 4P+4D 6%, 2P+6D <1% |
Finding: colo completes 100% of requests at every concurrency level — it degrades gracefully (latency rises 2.4s→81s, nothing dropped). Every static PD split collapses, and progressively earlier as N rises: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency percentiles count successes only, so they understate PD — read them with the completion column.)
Data: fig3_conc32k.json.
Caveat: N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.
3. Reproduce
# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
R=/home/admin/cpfs/wjh/agentic-kv-fresh
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
# concurrency axis (capped):
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
# render (reads the *.json in this dir):
python microbench/fresh_setup/plot_pd_crossover.py

