Files

Gahow Wang 54e1f5266a MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup

- fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency
  sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo
  uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every
  static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32).
- analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis
  3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output
  ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse
  ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-01 09:35:25 +08:00

5.3 KiB

Raw Permalink Blame History

PD-disagg vs colocation — controlled reuse & concurrency axes (v2)

Self-contained results for the controlled-variable redo of the MB5 PD-vs-colo ablation. Supersedes the confounded first cut (held input fixed and sliced the prefix, so "more reuse" was entangled with "less prefill"). All arms route through the proxy at fair APC parity (session-routed producers reach the same prefix-cache hit rate as colo), so PD loses on structure, not on broken cache.

Config arms: colo = 8×kv_both (8C-proxy, session-affinity); PD = 6P+2D / 4P+4D / 2P+6D.
Driver: closed-loop N (REPLAY_MAX_INFLIGHT) + fixed think-time; gen_synthetic_trace.py --mode regular.
PD-arm wall-cap: collapsed PD arms drain pathologically slowly, so PD arms run with a wall-deadline (REPLAY_MAX_DURATION; un-run turns counted as failures → honest completion%); colo is uncapped so the reference is always fully measured.
Hardware: run on dash2 (8×H20). dash0's RDMA NICs were faulted for Mooncake during this work (could not init the transfer engine; needs an admin reset — no sudo); dash2's NICs are healthy. cpfs/venv/data are shared across the boxes.

1. Reuse / APC axis — fixed real prefill, vary cached prefix

N=8. Hold the real new-prefill work per turn constant (--delta-len) and grow the cached prefix → reuse = prefix/(prefix+delta). Three shapes isolate output vs delta:

	delta (real prefill/turn)	output	role
A	2048	256	original
C	2048	128	A vs C = pure output 256→128
B	1024	128	C vs B = pure delta 2048→1024

PD-best advantage = colo E2E-p90 / best-PD E2E-p90 (>1 ⇒ PD wins):

reuse%	A d2048/o256	C d2048/o128	B d1024/o128
20	1.34	1.41	—
50	1.36	1.37	—
67	1.47	1.49	1.27
80	1.31	1.23	1.25
90	1.15	1.01	—
95	0.87	0.89	0.89

Findings:

Output length is ~negligible. A and C (same delta) track each other across the whole range — halving output barely moves PD's advantage.
Delta (real prefill/turn) is the dominant shape factor. B (delta=1024) sits clearly below A/C at mid reuse (67%: 1.27 vs ~1.48). More real prefill per turn → bigger PD win, because PD-disagg's benefit is isolating prefill from decode — more prefill to isolate.
Crossover to colo at reuse ~90–95% is robust across all three shapes: PD always loses the high-reuse / large-resident-context corner (it must KV-transfer the whole resident context every turn for a few hundred new tokens; colo keeps it local).

Caveat: the clean, uncapped, 100%-completion comparison region is reuse 20–80% (carries findings 1–2). At reuse 90/95% the PD arms collapse and C's points are capped-completion, while A/B are full-drain — comparable in direction, not in exact PD completion%.

Data: fig1_reuse_fixed.json (A), fig1_reuse_d2048_o128.json (C), fig1_reuse_d1024_o128.json (B).

2. Concurrency axis — agentic corner, sweep N

in=32768 (prefix 32256 + delta 512, reuse 0.984), out=128; closed-loop N ∈ {8,16,32,48,64,96,128}; PD arms capped 600s, colo uncapped.

N	colo completion · E2E-mean · TPS	best PD-arm completion
8	256/256 · 2.4s · 326	6P+2D 256/256
16	512/512 · 3.5s · 462	6P+2D 439/512 (86%)
32	1024/1024 · 13.3s · 190	all PD <27%
48	1536/1536 · 24.9s · 168	all PD <32%
64	2048/2048 · 38.4s · 166	all PD <31%
96	3072/3072 · 60.0s · 171	PD 2–7%
128	4096/4096 · 80.8s · 181	4P+4D 6%, 2P+6D <1%

Finding: colo completes 100% of requests at every concurrency level — it degrades gracefully (latency rises 2.4s→81s, nothing dropped). Every static PD split collapses, and progressively earlier as N rises: PD is viable only at N≤8–16; by N≥32 it drops 70–99% of requests while its prefix-cache hit-rate craters to ~0%. colo's elastic pool absorbs the time-varying P/D demand; the static partition + per-turn 32k KV-transfer cannot. (Latency percentiles count successes only, so they understate PD — read them with the completion column.)

Data: fig3_conc32k.json.

Caveat: N=128 6P+2D is missing (one transient vLLM/Mooncake startup flake at the end); does not change the picture (all PD arms are already collapsed by N=128). The SLO auto-stop in the driver is a no-op (a stdout-capture bug), so the full grid ran — more points, not fewer.

3. Reproduce

# on a box with healthy Mooncake/RDMA NICs (dash2), cpfs mounted:
R=/home/admin/cpfs/wjh/agentic-kv-fresh
# reuse axis (three shapes): DELTA/OL pick the shape; tag carries _o${OL}
ssh dash2 "cd $R && DELTA=2048 OL=256 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=2048 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
ssh dash2 "cd $R && DELTA=1024 OL=128 bash microbench/fresh_setup/run_reuse_fixed.sh"
# concurrency axis (capped):
ssh dash2 "cd $R && NLIST='8 16 32 48 64 96 128' CONC_PD_MAXDUR=600 bash microbench/fresh_setup/run_conc.sh"
# render (reads the *.json in this dir):
python microbench/fresh_setup/plot_pd_crossover.py

5.3 KiB Raw Permalink Blame History Unescape Escape

PD-disagg vs colocation — controlled reuse & concurrency axes (v2)

1. Reuse / APC axis — fixed real prefill, vary cached prefix

2. Concurrency axis — agentic corner, sweep N

3. Reproduce

5.3 KiB

Raw Permalink Blame History