The §3.2 cost-vs-benefit math in commits029821c(MB1 plot + pd_cost_vs_benefit.png) andabde010(RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit 2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 % 33k tok | 4.5 s | 320 ms | 36 s | 0.9 % 125k tok | 57 s | 1.9 s | 456 s | 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 KiB
MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)
Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the Summary block at the top is what gets cited in the paper.
Summary (latest)
| Path | Steady-state BW | Agentic-tail p99 transfer (11.5 GiB KV) |
|---|---|---|
| intra-node (dash1 GPU 0↔1) | ~9.7 GB/s (96 MiB – 3 GiB) | p50 1.9 s · min 1.5 s · max 10 s |
| inter-node (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE) | ~10.0 GB/s (essentially identical) | p50 1.7 s · min 1.3 s · max 9.2 s |
Cross-cutting finding (2026-05-27): Mooncake transfer cost is
topology-independent on this hardware. Intra-node and inter-node curves
are statistically indistinguishable (see figs/mb2_transfer_time_compare.png,
figs/mb2_transfer_bw_compare.png). Mechanism: Mooncake's
batch_transfer_sync_write always goes through the RDMA NIC, including
the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is
the bottleneck. Implication for §3.2: PD-disaggregation does not
get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
ceiling applies regardless. Halving the transfer cost cannot be bought
back by topology.
What MB2 actually measures: the per-request charge that
PD-disagg pays for every routed request — T_transfer ≈ KV_size / 9.7 GB/s. For agentic this is 8 ms (192 MiB / trace lower) – 1.9 s
(11.5 GiB / p99).
⚠ Correction (2026-05-27): an earlier version of this README
framed §3.2 as "transfer cost (1.5–10 s) >> decode duration (50–200 ms),
so PD-disagg loses on cost-vs-benefit." That accounting was wrong:
PD-disagg's phase-isolation benefit is per-prefill-event and equals
D × T_prefill (aggregate across stalled decode streams), not the
single-request decode duration. With trace-mean T_prefill = 4.5 s and
D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer
cost. PD-disagg's phase-isolation axis is a win, not a loss.
The actual reason static PD-disagg fails in agentic is D-side KV
capacity (figs/f4b_pdsep_kv_wall.png), not a cost-vs-benefit
imbalance. See RESULTS_SUMMARY.md section 4 for the corrected
framing. MB2 still serves as the source of the per-request transfer
cost number used in that analysis.
Setup
| Component | Value |
|---|---|
| Host | dash1 (ds-6348bee4-1-...-rwkv2), 8× NVIDIA H20 96 GiB, driver 570.133.20 |
| Venv | /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (shared via cpfs from any dash host) |
| vLLM | 0.18.1 official wheel |
| mooncake-transfer-engine | 0.3.11.post1 (pip install mooncake-transfer-engine) |
| Model | /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct |
| Per-token KV | 98304 B |
| kv_role | kv_both on both instances (see Known limitations re kv_producer/kv_consumer) |
| Per-instance config | --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching |
Method
3-step black-box bench:
do_remote_decodeto A (producer) with a client-generatedtransfer_id.max_tokens=1; A computes prefill and parks the KV for later pull.do_remote_prefillto B (consumer) with the sametransfer_idplusremote_engine_id(from A's/queryon bootstrap port) andremote_bootstrap_addr(http://127.0.0.1:8998). This step triggers the actual KV transfer; it is the measured step.- Plain
completionon B (--skip-verifyoff): expectcached_tokens ≈ prompt_len, confirming the KV landed on B.
Per-stage breakdown is obtained by instrumenting the vLLM-shipped
MooncakeConnector (NOT the mooncake-package's mooncake_connector_v1,
which vLLM 0.18.1 does not load) at two sites:
_send_blocks(P-side, line 980): emitssend_blocksevent withtotal_bytes,duration_s,t_start_unix. Theduration_sis the wall-time of a singlebatch_transfer_sync_writecall — this is what we callpure_transfer.receive_kv_from_single_worker(D-side, line 1139, async): emitsreceive_kv_enterat function start andreceive_kv_finishon FINISH-status response. The wall-time between them isrx_total(= ZMQ round-trip + setup + pure_transfer + ack).
Pairing across A's and B's logs is by time window: each B
(enter, finish) pair is matched to the A send_blocks whose
t_start_unix falls in [rx_t_start, rx_t_end]. With single-request
benchmarks this is unambiguous.
Scripts:
microbench/fresh_setup/start_vllm_pair.sh— bring up pair + apply/revert patchmicrobench/fresh_setup/instrument_mooncake.py— apply/revert MB2 patchesmicrobench/fresh_setup/mb2_kv_transfer.py— client (3-step bench loop)microbench/fresh_setup/analyze_mb2.py— pair A/B events into per-size tablemicrobench/fresh_setup/plot_mb2.py— log-log time + bandwidth curves
Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)
Raw events: A_intra_kvboth.jsonl, B_intra_kvboth.jsonl.
Joined + aggregated: intra_kvboth_breakdown.json.
Figures: figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png.
| input_tokens | KV (MiB) | n | pure_ms p50 | pure_ms max | rx_total_ms | overhead_ms | BW p50 (GB/s) | BW max (GB/s) |
|---|---|---|---|---|---|---|---|---|
| 512 | 48 | 5 | 5.3 | 5.6 | 12.2 | 3.3 | 9.40 | 9.53 |
| 1024 | 96 | 5 | 10.4 | 10.5 | 11.9 | 1.5 | 9.68 | 9.72 |
| 2048 | 192 | 5 | 20.6 | 21.0 | 22.5 | 1.8 | 9.75 | 9.78 |
| 4096 | 384 | 5 | 41.5 | 41.7 | 43.5 | 2.0 | 9.71 | 9.72 |
| 8192 | 768 | 5 | 83.7 | 84.4 | 86.2 | 2.2 | 9.62 | 9.69 |
| 16384 | 1536 | 5 | 167.1 | 167.7 | 170.2 | 2.7 | 9.64 | 9.67 |
| 32768 | 3072 | 5 | 320.9 | 322.1 | 425.2 | 20.5 | 10.04 | 10.09 |
| 65536 | 6144 | 5 | 1895.1 | 2375.2 | 1586.1 | 69.6 | 3.40 | 9.68 |
| 131072 | 12288 | 5 | 2835.1 | 8923.6 | 4362.5 | 91.4 | 4.54 | 9.67 |
Three regimes in the data:
- <= 3 GiB — linear in size, bandwidth ≈ 9.7 GB/s steady.
- 6 GiB ± a bit — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 2–3 × longer.
- 12 GiB — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region.
The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p
(claimed ~900 GB/s in IB) — likely the transfer is PCIe-staged
through host memory rather than NVLink direct. To confirm we would
need nvidia-smi topo -m and mooncake_transfer_engine_topology_dump
analysis; not done yet.
Known limitations of this measurement
- kv_both, not strict PD-disagg. vLLM 0.18.1 with
kv_role=kv_consumerraisesAttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server'(the attribute is only assigned insideif not self.is_kv_consumer). The transfer mechanics are identical — samebatch_transfer_sync_write— so the cost measurement is comparable. The role gate only affects which request types each instance accepts. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy. - Single in-flight request. All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized.
- Intra-node only. Inter-node RDMA path will be slower; not yet measured.
- Sanity preamble events. The raw logs include 6 events from
earlier sanity runs in addition to the 45-event sweep.
analyze_mb2.pytreats them as additional samples (same sizes); the per-size aggregates use all of them.
Implications for §3.2 PD-disagg argument
For each PD-disagg-routed request, transfer wall-time is:
T_transfer(KV_size) ≈ KV_size / 9.7 GB/s for KV_size ≤ 3 GiB
≈ 0.3 – 10 s for KV_size in [3, 12] GiB
This is the per-request transfer charge of PD-disagg. It's a real cost, but in the context of phase-isolation accounting it is small compared to the benefit:
| Prefill | T_prefill (MB1) | T_transfer (MB2) | Phase-isolation benefit at D=8 = D × T_prefill |
|---|---|---|---|
| 2k tok (trace lower) | 0.14 s | 8 ms | 1.1 s |
| 33k tok (trace mean) | 4.5 s | 320 ms | 36 s |
| 125k tok (~p99) | 57 s | 1.9 s | 456 s |
On the phase-isolation axis alone, PD-disagg recovers two orders of
magnitude more decode time than it pays in transfer. It is NOT this
axis that defeats static PD-disagg in agentic — see colleague's
4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is
driven by D-side KV-pool overflow on long-context requests
(figs/f4b_pdsep_kv_wall.png), not by transfer latency.
What MB2 contributes to the paper is therefore:
- The per-request transfer cost number (used as the cost input to the cost-benefit accounting above).
- The empirical observation that Mooncake's transfer cost is topology-independent — intra-node and inter-node both go through the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's transfer cost does not get cheaper by co-locating P and D.
The dominant §3.2 failure mode of static PD-disagg in agentic is capacity, not transfer cost. MB3 / MB4 / MB5 will quantify the remaining axes (D-pool occupancy, cache reuse degradation under PD routing, static-partition mismatch).
Open questions / next runs
- Inter-node RDMA: dash1 ↔ dash2. Expected lower bandwidth (~5–15 GB/s); want to see if the 6 GiB-onset variance moves.
- Bandwidth ceiling investigation: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config?
- Variance at 6+ GiB: investigate. Maybe related to chunking
inside
batch_transfer_sync_write, or GPU memory pressure when KV approaches HBM ceiling. - Concurrent transfers: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this.
- Strict kv_producer/kv_consumer: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged.
Reproduction
# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1 # push scripts to cpfs
# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'
# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
--sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
--repeats 5 --label intra-kvboth \
--out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'
# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
analysis/mb2/B_intra_kvboth.jsonl
# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
--a-log analysis/mb2/A_intra_kvboth.jsonl \
--b-log analysis/mb2/B_intra_kvboth.jsonl \
--out analysis/mb2/intra_kvboth_breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb2.py \
--breakdown analysis/mb2/intra_kvboth_breakdown.json \
--out-time figs/mb2_transfer_time_intra.png \
--out-bw figs/mb2_transfer_bw_intra.png
# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'
Run log
2026-05-27 — intra-node, kv_both, dash1 GPU 0+1
Sweep: 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072 tokens
× 5 repeats. Sanity preamble of 512, 2048, 8192 × 2 included in the
raw logs (counted as additional samples for those sizes).
Result table above. 9.7 GB/s steady-state up to 3 GiB, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s.
Committed as de164e5.
2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0
Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping).
Producer A on dash1 GPU 0, consumer B on dash2 GPU 0.
remote_bootstrap_addr=http://172.27.123.142:8998 (dash1's internal IP).
Raw events: A_inter_kvboth.jsonl (45 send_blocks + 6 sanity).
B's receive_kv events are missing for this run — the
MB2_LOG_DIR env var did not propagate from the start-script through
vLLM's EngineCore subprocess on dash2 (visible via
cat /proc/$ENGINE_PID/environ shows empty for dash2 but contains
MB2_LOG_DIR for dash1 — bookmark for future investigation, likely
spawn-vs-fork difference in vLLM's multiproc executor across hosts).
Pure-transfer numbers below come from A's send_blocks alone; full
rx_total breakdown not available for this run.
Per-size pure-transfer (analyzed by analyze_mb2_send_only.py):
| input_tokens | KV (MiB) | n | pure_ms p50 | min | max | BW p50 (GB/s) | BW max |
|---|---|---|---|---|---|---|---|
| 512 | 48 | 5 | 5.2 | 5.1 | 65.8 | 9.76 | 9.81 |
| 1024 | 96 | 5 | 10.2 | 10.1 | 10.4 | 9.91 | 10.00 |
| 2048 | 192 | 5 | 20.0 | 20.0 | 20.5 | 10.06 | 10.07 |
| 4096 | 384 | 5 | 40.1 | 40.1 | 40.5 | 10.04 | 10.05 |
| 8192 | 768 | 5 | 80.9 | 80.7 | 82.5 | 9.96 | 9.98 |
| 16384 | 1536 | 5 | 161.8 | 161.7 | 164.8 | 9.96 | 9.96 |
| 32768 | 3072 | 5 | 309.6 | 307.7 | 526.9 | 10.40 | 10.47 |
| 65536 | 6144 | 5 | 1733.6 | 653.5 | 1921.2 | 3.72 | 9.86 |
| 131072 | 12288 | 5 | 2818.4 | 1283.0 | 9158.6 | 4.57 | 10.04 |
Side-by-side comparison with the 2026-05-27 intra-node run:
| Size | intra p50 ms | inter p50 ms | gap | intra GB/s | inter GB/s |
|---|---|---|---|---|---|
| 512 | 5.3 | 5.2 | −2% | 9.40 | 9.76 |
| 1024 | 10.4 | 10.2 | −2% | 9.68 | 9.91 |
| 2048 | 20.6 | 20.0 | −3% | 9.75 | 10.06 |
| 4096 | 41.5 | 40.1 | −3% | 9.71 | 10.04 |
| 8192 | 83.7 | 80.9 | −3% | 9.62 | 9.96 |
| 16384 | 167.1 | 161.8 | −3% | 9.64 | 9.96 |
| 32768 | 320.9 | 309.6 | −3% | 10.04 | 10.40 |
| 65536 | 1895.1 | 1733.6 | −9% | 3.40 | 3.72 |
| 131072 | 2835.1 | 2818.4 | −1% | 4.54 | 4.57 |
The two paths produce essentially the same numbers — mooncake intra- node is not using NVLink, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths.
Figures: figs/mb2_transfer_time_inter.png, figs/mb2_transfer_bw_inter.png,
figs/mb2_transfer_time_compare.png (overlay), figs/mb2_transfer_bw_compare.png.
This collapses the §3.2 narrative to a single number: PD-disagg across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter how you place P and D (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s. Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.