Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE. remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep config as the 2026-05-27 intra-node run. Per-size pure_transfer (p50) lines up within 1–3% of the intra-node numbers across all sizes: size intra p50 inter p50 512 tok 5.3 ms 5.2 ms 2048 tok 20.6 20.0 8192 tok 83.7 80.9 32k tok 320.9 309.6 64k tok 1895 1734 (bimodal in both) 128k tok 2835 2818 (bimodal in both) => Mooncake's batch_transfer_sync_write **does not use NVLink** for intra-node peers; both paths go through the 200 Gbps RDMA NIC, with the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The ~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are identical across topologies. Operational implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — every routed request pays the same ~10 GB/s ceiling for KV transfer, no matter where it lands. Halving the transfer cost cannot be bought back by topology. Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR` env var did not propagate through vLLM's EngineCore subprocess on the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2 for that var, but the producer host on dash1 worked). For this run pure_transfer numbers are from A's send_blocks alone; full rx_total breakdown is not available, but pure_transfer is the dominant term. Adds: - analyze_mb2_send_only.py — analyzer that works from A's send_blocks alone when B's receive_kv events are absent - plot_mb2_compare.py — overlay intra vs inter on the same axes - plot_mb2.py — tolerate the `rows`-less send-only schema - figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve - figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay - analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json, inter_kvboth_breakdown.json - analysis/mb2/README.md — Summary block updated to reference both paths, dated 2026-05-27 run-log entry appended with the full table and the topology-independence framing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14 KiB
MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)
Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the Summary block at the top is what gets cited in the paper.
Summary (latest)
| Path | Steady-state BW | Agentic-tail p99 transfer (11.5 GiB KV) |
|---|---|---|
| intra-node (dash1 GPU 0↔1) | ~9.7 GB/s (96 MiB – 3 GiB) | p50 1.9 s · min 1.5 s · max 10 s |
| inter-node (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE) | ~10.0 GB/s (essentially identical) | p50 1.7 s · min 1.3 s · max 9.2 s |
Cross-cutting finding (2026-05-27): Mooncake transfer cost is
topology-independent on this hardware. Intra-node and inter-node curves
are statistically indistinguishable (see figs/mb2_transfer_time_compare.png,
figs/mb2_transfer_bw_compare.png). Mechanism: Mooncake's
batch_transfer_sync_write always goes through the RDMA NIC, including
the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is
the bottleneck. Implication for §3.2: PD-disaggregation does not
get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
ceiling applies regardless. Halving the transfer cost cannot be bought
back by topology.
Headline for the paper §3.2: at the agentic tail, pure KV transfer takes 1.5 – 10 s. A median agentic decode is 50 – 200 ms of tool-call output. So PD-disaggregation adds 8 – 100 × decode-time of transfer on top of every routed request. Phase isolation (the thing PD-disagg trades transfer cost for) can only win back at most one decode duration — for agentic that's negligible. The arithmetic is one-sided.
Setup
| Component | Value |
|---|---|
| Host | dash1 (ds-6348bee4-1-...-rwkv2), 8× NVIDIA H20 96 GiB, driver 570.133.20 |
| Venv | /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (shared via cpfs from any dash host) |
| vLLM | 0.18.1 official wheel |
| mooncake-transfer-engine | 0.3.11.post1 (pip install mooncake-transfer-engine) |
| Model | /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct |
| Per-token KV | 98304 B |
| kv_role | kv_both on both instances (see Known limitations re kv_producer/kv_consumer) |
| Per-instance config | --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching |
Method
3-step black-box bench:
do_remote_decodeto A (producer) with a client-generatedtransfer_id.max_tokens=1; A computes prefill and parks the KV for later pull.do_remote_prefillto B (consumer) with the sametransfer_idplusremote_engine_id(from A's/queryon bootstrap port) andremote_bootstrap_addr(http://127.0.0.1:8998). This step triggers the actual KV transfer; it is the measured step.- Plain
completionon B (--skip-verifyoff): expectcached_tokens ≈ prompt_len, confirming the KV landed on B.
Per-stage breakdown is obtained by instrumenting the vLLM-shipped
MooncakeConnector (NOT the mooncake-package's mooncake_connector_v1,
which vLLM 0.18.1 does not load) at two sites:
_send_blocks(P-side, line 980): emitssend_blocksevent withtotal_bytes,duration_s,t_start_unix. Theduration_sis the wall-time of a singlebatch_transfer_sync_writecall — this is what we callpure_transfer.receive_kv_from_single_worker(D-side, line 1139, async): emitsreceive_kv_enterat function start andreceive_kv_finishon FINISH-status response. The wall-time between them isrx_total(= ZMQ round-trip + setup + pure_transfer + ack).
Pairing across A's and B's logs is by time window: each B
(enter, finish) pair is matched to the A send_blocks whose
t_start_unix falls in [rx_t_start, rx_t_end]. With single-request
benchmarks this is unambiguous.
Scripts:
microbench/fresh_setup/start_vllm_pair.sh— bring up pair + apply/revert patchmicrobench/fresh_setup/instrument_mooncake.py— apply/revert MB2 patchesmicrobench/fresh_setup/mb2_kv_transfer.py— client (3-step bench loop)microbench/fresh_setup/analyze_mb2.py— pair A/B events into per-size tablemicrobench/fresh_setup/plot_mb2.py— log-log time + bandwidth curves
Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)
Raw events: A_intra_kvboth.jsonl, B_intra_kvboth.jsonl.
Joined + aggregated: intra_kvboth_breakdown.json.
Figures: figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png.
| input_tokens | KV (MiB) | n | pure_ms p50 | pure_ms max | rx_total_ms | overhead_ms | BW p50 (GB/s) | BW max (GB/s) |
|---|---|---|---|---|---|---|---|---|
| 512 | 48 | 5 | 5.3 | 5.6 | 12.2 | 3.3 | 9.40 | 9.53 |
| 1024 | 96 | 5 | 10.4 | 10.5 | 11.9 | 1.5 | 9.68 | 9.72 |
| 2048 | 192 | 5 | 20.6 | 21.0 | 22.5 | 1.8 | 9.75 | 9.78 |
| 4096 | 384 | 5 | 41.5 | 41.7 | 43.5 | 2.0 | 9.71 | 9.72 |
| 8192 | 768 | 5 | 83.7 | 84.4 | 86.2 | 2.2 | 9.62 | 9.69 |
| 16384 | 1536 | 5 | 167.1 | 167.7 | 170.2 | 2.7 | 9.64 | 9.67 |
| 32768 | 3072 | 5 | 320.9 | 322.1 | 425.2 | 20.5 | 10.04 | 10.09 |
| 65536 | 6144 | 5 | 1895.1 | 2375.2 | 1586.1 | 69.6 | 3.40 | 9.68 |
| 131072 | 12288 | 5 | 2835.1 | 8923.6 | 4362.5 | 91.4 | 4.54 | 9.67 |
Three regimes in the data:
- <= 3 GiB — linear in size, bandwidth ≈ 9.7 GB/s steady.
- 6 GiB ± a bit — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 2–3 × longer.
- 12 GiB — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region.
The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p
(claimed ~900 GB/s in IB) — likely the transfer is PCIe-staged
through host memory rather than NVLink direct. To confirm we would
need nvidia-smi topo -m and mooncake_transfer_engine_topology_dump
analysis; not done yet.
Known limitations of this measurement
- kv_both, not strict PD-disagg. vLLM 0.18.1 with
kv_role=kv_consumerraisesAttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server'(the attribute is only assigned insideif not self.is_kv_consumer). The transfer mechanics are identical — samebatch_transfer_sync_write— so the cost measurement is comparable. The role gate only affects which request types each instance accepts. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy. - Single in-flight request. All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized.
- Intra-node only. Inter-node RDMA path will be slower; not yet measured.
- Sanity preamble events. The raw logs include 6 events from
earlier sanity runs in addition to the 45-event sweep.
analyze_mb2.pytreats them as additional samples (same sizes); the per-size aggregates use all of them.
Implications for §3.2 PD-disagg cost argument
For each PD-disagg-routed request, transfer wall-time is:
T_transfer(KV_size) = max( pure_transfer(KV_size), rx_overhead )
≈ KV_size / 9.7 GB/s for KV_size <= 3 GiB
≈ 0.3 – 10 s for KV_size in [3, 12] GiB
Agentic decode wall-time is typically 50 – 200 ms (tool-call output of a few tens of tokens at ~50 tok/s). So the transfer/decode ratio under intra-node best-case Mooncake is:
| KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode |
|---|---|---|---|
| 192 MiB (2k tok) | 20 ms | 100 ms | 0.2× |
| 768 MiB (8k tok) | 84 ms | 100 ms | 0.8× |
| 3 GiB (33k tok ≈ trace mean) | 321 ms | 100 ms | 3.2× |
| 6 GiB (~p90) | 1900 ms | 100 ms | 19× |
| 12 GiB (~p99) | 2800 ms | 100 ms | 28× (median) – 100× (p99 variance) |
PD-disagg's promised payoff is eliminating prefill–decode interference on the decode instance. The maximum benefit it can buy is bounded above by the decode duration itself (you cannot recover more time than the decode existed). For agentic that's 50 – 200 ms. The cost is the table column above — 0.3 – 10 s of transfer per routed request.
Cost > Benefit by 5× to 100× across the agentic distribution. Below ~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above 6 GiB even individual draws can take 10 s for a single transfer.
This data alone is not the whole §3.2 argument — we still need to
account for D-side KV capacity (f4b, separate axis), cache reuse loss,
and static-partition mismatch (MB3 / MB4 / MB5). But it nails one
of the two key cost axes with measured numbers from vanilla mooncake,
not the dash0 patched build.
Open questions / next runs
- Inter-node RDMA: dash1 ↔ dash2. Expected lower bandwidth (~5–15 GB/s); want to see if the 6 GiB-onset variance moves.
- Bandwidth ceiling investigation: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config?
- Variance at 6+ GiB: investigate. Maybe related to chunking
inside
batch_transfer_sync_write, or GPU memory pressure when KV approaches HBM ceiling. - Concurrent transfers: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this.
- Strict kv_producer/kv_consumer: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged.
Reproduction
# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1 # push scripts to cpfs
# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'
# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
--sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
--repeats 5 --label intra-kvboth \
--out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'
# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
analysis/mb2/B_intra_kvboth.jsonl
# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
--a-log analysis/mb2/A_intra_kvboth.jsonl \
--b-log analysis/mb2/B_intra_kvboth.jsonl \
--out analysis/mb2/intra_kvboth_breakdown.json
.venv/bin/python microbench/fresh_setup/plot_mb2.py \
--breakdown analysis/mb2/intra_kvboth_breakdown.json \
--out-time figs/mb2_transfer_time_intra.png \
--out-bw figs/mb2_transfer_bw_intra.png
# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'
Run log
2026-05-27 — intra-node, kv_both, dash1 GPU 0+1
Sweep: 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072 tokens
× 5 repeats. Sanity preamble of 512, 2048, 8192 × 2 included in the
raw logs (counted as additional samples for those sizes).
Result table above. 9.7 GB/s steady-state up to 3 GiB, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s.
Committed as de164e5.
2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0
Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping).
Producer A on dash1 GPU 0, consumer B on dash2 GPU 0.
remote_bootstrap_addr=http://172.27.123.142:8998 (dash1's internal IP).
Raw events: A_inter_kvboth.jsonl (45 send_blocks + 6 sanity).
B's receive_kv events are missing for this run — the
MB2_LOG_DIR env var did not propagate from the start-script through
vLLM's EngineCore subprocess on dash2 (visible via
cat /proc/$ENGINE_PID/environ shows empty for dash2 but contains
MB2_LOG_DIR for dash1 — bookmark for future investigation, likely
spawn-vs-fork difference in vLLM's multiproc executor across hosts).
Pure-transfer numbers below come from A's send_blocks alone; full
rx_total breakdown not available for this run.
Per-size pure-transfer (analyzed by analyze_mb2_send_only.py):
| input_tokens | KV (MiB) | n | pure_ms p50 | min | max | BW p50 (GB/s) | BW max |
|---|---|---|---|---|---|---|---|
| 512 | 48 | 5 | 5.2 | 5.1 | 65.8 | 9.76 | 9.81 |
| 1024 | 96 | 5 | 10.2 | 10.1 | 10.4 | 9.91 | 10.00 |
| 2048 | 192 | 5 | 20.0 | 20.0 | 20.5 | 10.06 | 10.07 |
| 4096 | 384 | 5 | 40.1 | 40.1 | 40.5 | 10.04 | 10.05 |
| 8192 | 768 | 5 | 80.9 | 80.7 | 82.5 | 9.96 | 9.98 |
| 16384 | 1536 | 5 | 161.8 | 161.7 | 164.8 | 9.96 | 9.96 |
| 32768 | 3072 | 5 | 309.6 | 307.7 | 526.9 | 10.40 | 10.47 |
| 65536 | 6144 | 5 | 1733.6 | 653.5 | 1921.2 | 3.72 | 9.86 |
| 131072 | 12288 | 5 | 2818.4 | 1283.0 | 9158.6 | 4.57 | 10.04 |
Side-by-side comparison with the 2026-05-27 intra-node run:
| Size | intra p50 ms | inter p50 ms | gap | intra GB/s | inter GB/s |
|---|---|---|---|---|---|
| 512 | 5.3 | 5.2 | −2% | 9.40 | 9.76 |
| 1024 | 10.4 | 10.2 | −2% | 9.68 | 9.91 |
| 2048 | 20.6 | 20.0 | −3% | 9.75 | 10.06 |
| 4096 | 41.5 | 40.1 | −3% | 9.71 | 10.04 |
| 8192 | 83.7 | 80.9 | −3% | 9.62 | 9.96 |
| 16384 | 167.1 | 161.8 | −3% | 9.64 | 9.96 |
| 32768 | 320.9 | 309.6 | −3% | 10.04 | 10.40 |
| 65536 | 1895.1 | 1733.6 | −9% | 3.40 | 3.72 |
| 131072 | 2835.1 | 2818.4 | −1% | 4.54 | 4.57 |
The two paths produce essentially the same numbers — mooncake intra- node is not using NVLink, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths.
Figures: figs/mb2_transfer_time_inter.png, figs/mb2_transfer_bw_inter.png,
figs/mb2_transfer_time_compare.png (overlay), figs/mb2_transfer_bw_compare.png.
This collapses the §3.2 narrative to a single number: PD-disagg across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter how you place P and D (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s. Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.