Files

Gahow Wang 90127c3389 MB2 inter-node: dash1↔dash2 transfer cost is identical to intra-node

Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.

Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:

  size      intra p50   inter p50
   512 tok    5.3 ms      5.2 ms
  2048 tok   20.6         20.0
  8192 tok   83.7         80.9
 32k  tok  320.9        309.6
 64k  tok 1895          1734       (bimodal in both)
128k  tok 2835          2818       (bimodal in both)

=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.

Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.

Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.

Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
  alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
  inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
  paths, dated 2026-05-27 run-log entry appended with the full table
  and the topology-independence framing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 20:56:08 +08:00

14 KiB

Raw Blame History

MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)

Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the Summary block at the top is what gets cited in the paper.

Summary (latest)

Path	Steady-state BW	Agentic-tail p99 transfer (11.5 GiB KV)
intra-node (dash1 GPU 0↔1)	~9.7 GB/s (96 MiB – 3 GiB)	p50 1.9 s · min 1.5 s · max 10 s
inter-node (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE)	~10.0 GB/s (essentially identical)	p50 1.7 s · min 1.3 s · max 9.2 s

Cross-cutting finding (2026-05-27): Mooncake transfer cost is topology-independent on this hardware. Intra-node and inter-node curves are statistically indistinguishable (see figs/mb2_transfer_time_compare.png, figs/mb2_transfer_bw_compare.png). Mechanism: Mooncake's batch_transfer_sync_write always goes through the RDMA NIC, including the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is the bottleneck. Implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — the ~9.7 GB/s ceiling applies regardless. Halving the transfer cost cannot be bought back by topology.

Headline for the paper §3.2: at the agentic tail, pure KV transfer takes 1.5 – 10 s. A median agentic decode is 50 – 200 ms of tool-call output. So PD-disaggregation adds 8 – 100 × decode-time of transfer on top of every routed request. Phase isolation (the thing PD-disagg trades transfer cost for) can only win back at most one decode duration — for agentic that's negligible. The arithmetic is one-sided.

Setup

Component	Value
Host	`dash1` (`ds-6348bee4-1-...-rwkv2`), 8× NVIDIA H20 96 GiB, driver 570.133.20
Venv	`/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` (shared via cpfs from any dash host)
vLLM	0.18.1 official wheel
mooncake-transfer-engine	0.3.11.post1 (`pip install mooncake-transfer-engine`)
Model	`/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`
Per-token KV	98304 B
kv_role	`kv_both` on both instances (see Known limitations re kv_producer/kv_consumer)
Per-instance config	`--tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching`

Method

3-step black-box bench:

do_remote_decode to A (producer) with a client-generated transfer_id. max_tokens=1; A computes prefill and parks the KV for later pull.
do_remote_prefill to B (consumer) with the same transfer_id plus remote_engine_id (from A's /query on bootstrap port) and remote_bootstrap_addr (http://127.0.0.1:8998). This step triggers the actual KV transfer; it is the measured step.
Plain completion on B (--skip-verify off): expect cached_tokens ≈ prompt_len, confirming the KV landed on B.

Per-stage breakdown is obtained by instrumenting the vLLM-shipped MooncakeConnector (NOT the mooncake-package's mooncake_connector_v1, which vLLM 0.18.1 does not load) at two sites:

_send_blocks (P-side, line 980): emits send_blocks event with total_bytes, duration_s, t_start_unix. The duration_s is the wall-time of a single batch_transfer_sync_write call — this is what we call pure_transfer.
receive_kv_from_single_worker (D-side, line 1139, async): emits receive_kv_enter at function start and receive_kv_finish on FINISH-status response. The wall-time between them is rx_total (= ZMQ round-trip + setup + pure_transfer + ack).

Pairing across A's and B's logs is by time window: each B (enter, finish) pair is matched to the A send_blocks whose t_start_unix falls in [rx_t_start, rx_t_end]. With single-request benchmarks this is unambiguous.

Scripts:

microbench/fresh_setup/start_vllm_pair.sh — bring up pair + apply/revert patch
microbench/fresh_setup/instrument_mooncake.py — apply/revert MB2 patches
microbench/fresh_setup/mb2_kv_transfer.py — client (3-step bench loop)
microbench/fresh_setup/analyze_mb2.py — pair A/B events into per-size table
microbench/fresh_setup/plot_mb2.py — log-log time + bandwidth curves

Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)

Raw events: A_intra_kvboth.jsonl, B_intra_kvboth.jsonl. Joined + aggregated: intra_kvboth_breakdown.json. Figures: figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png.

input_tokens	KV (MiB)	n	pure_ms p50	pure_ms max	rx_total_ms	overhead_ms	BW p50 (GB/s)	BW max (GB/s)
512	48	5	5.3	5.6	12.2	3.3	9.40	9.53
1024	96	5	10.4	10.5	11.9	1.5	9.68	9.72
2048	192	5	20.6	21.0	22.5	1.8	9.75	9.78
4096	384	5	41.5	41.7	43.5	2.0	9.71	9.72
8192	768	5	83.7	84.4	86.2	2.2	9.62	9.69
16384	1536	5	167.1	167.7	170.2	2.7	9.64	9.67
32768	3072	5	320.9	322.1	425.2	20.5	10.04	10.09
65536	6144	5	1895.1	2375.2	1586.1	69.6	3.40	9.68
131072	12288	5	2835.1	8923.6	4362.5	91.4	4.54	9.67

Three regimes in the data:

<= 3 GiB — linear in size, bandwidth ≈ 9.7 GB/s steady.
6 GiB ± a bit — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 2–3 × longer.
12 GiB — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region.

The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p (claimed ~900 GB/s in IB) — likely the transfer is PCIe-staged through host memory rather than NVLink direct. To confirm we would need nvidia-smi topo -m and mooncake_transfer_engine_topology_dump analysis; not done yet.

Known limitations of this measurement

kv_both, not strict PD-disagg. vLLM 0.18.1 with kv_role=kv_consumer raises AttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server' (the attribute is only assigned inside if not self.is_kv_consumer). The transfer mechanics are identical — same batch_transfer_sync_write — so the cost measurement is comparable. The role gate only affects which request types each instance accepts. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy.
Single in-flight request. All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized.
Intra-node only. Inter-node RDMA path will be slower; not yet measured.
Sanity preamble events. The raw logs include 6 events from earlier sanity runs in addition to the 45-event sweep. analyze_mb2.py treats them as additional samples (same sizes); the per-size aggregates use all of them.

Implications for §3.2 PD-disagg cost argument

For each PD-disagg-routed request, transfer wall-time is:

T_transfer(KV_size) = max(  pure_transfer(KV_size),  rx_overhead  )
                    ≈ KV_size / 9.7 GB/s   for KV_size <= 3 GiB
                    ≈ 0.3 – 10 s            for KV_size in [3, 12] GiB

Agentic decode wall-time is typically 50 – 200 ms (tool-call output of a few tens of tokens at ~50 tok/s). So the transfer/decode ratio under intra-node best-case Mooncake is:

KV size	T_transfer @9.7 GB/s	typical decode	T_transfer / T_decode
192 MiB (2k tok)	20 ms	100 ms	0.2×
768 MiB (8k tok)	84 ms	100 ms	0.8×
3 GiB (33k tok ≈ trace mean)	321 ms	100 ms	3.2×
6 GiB (~p90)	1900 ms	100 ms	19×
12 GiB (~p99)	2800 ms	100 ms	28× (median) – 100× (p99 variance)

PD-disagg's promised payoff is eliminating prefill–decode interference on the decode instance. The maximum benefit it can buy is bounded above by the decode duration itself (you cannot recover more time than the decode existed). For agentic that's 50 – 200 ms. The cost is the table column above — 0.3 – 10 s of transfer per routed request.

Cost > Benefit by 5× to 100× across the agentic distribution. Below ~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above 6 GiB even individual draws can take 10 s for a single transfer.

This data alone is not the whole §3.2 argument — we still need to account for D-side KV capacity (f4b, separate axis), cache reuse loss, and static-partition mismatch (MB3 / MB4 / MB5). But it nails one of the two key cost axes with measured numbers from vanilla mooncake, not the dash0 patched build.

Open questions / next runs

Inter-node RDMA: dash1 ↔ dash2. Expected lower bandwidth (~5–15 GB/s); want to see if the 6 GiB-onset variance moves.
Bandwidth ceiling investigation: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config?
Variance at 6+ GiB: investigate. Maybe related to chunking inside batch_transfer_sync_write, or GPU memory pressure when KV approaches HBM ceiling.
Concurrent transfers: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this.
Strict kv_producer/kv_consumer: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged.

Reproduction

# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh        # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1    # push scripts to cpfs

# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
    --sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
    --repeats 5 --label intra-kvboth \
    --out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'

# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/B_intra_kvboth.jsonl

# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
  --a-log analysis/mb2/A_intra_kvboth.jsonl \
  --b-log analysis/mb2/B_intra_kvboth.jsonl \
  --out analysis/mb2/intra_kvboth_breakdown.json

.venv/bin/python microbench/fresh_setup/plot_mb2.py \
  --breakdown analysis/mb2/intra_kvboth_breakdown.json \
  --out-time figs/mb2_transfer_time_intra.png \
  --out-bw figs/mb2_transfer_bw_intra.png

# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'

Run log

2026-05-27 — intra-node, kv_both, dash1 GPU 0+1

Sweep: 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072 tokens × 5 repeats. Sanity preamble of 512, 2048, 8192 × 2 included in the raw logs (counted as additional samples for those sizes).

Result table above. 9.7 GB/s steady-state up to 3 GiB, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s.

Committed as de164e5.

2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0

Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping). Producer A on dash1 GPU 0, consumer B on dash2 GPU 0. remote_bootstrap_addr=http://172.27.123.142:8998 (dash1's internal IP).

Raw events: A_inter_kvboth.jsonl (45 send_blocks + 6 sanity). B's receive_kv events are missing for this run — the MB2_LOG_DIR env var did not propagate from the start-script through vLLM's EngineCore subprocess on dash2 (visible via cat /proc/$ENGINE_PID/environ shows empty for dash2 but contains MB2_LOG_DIR for dash1 — bookmark for future investigation, likely spawn-vs-fork difference in vLLM's multiproc executor across hosts). Pure-transfer numbers below come from A's send_blocks alone; full rx_total breakdown not available for this run.

Per-size pure-transfer (analyzed by analyze_mb2_send_only.py):

input_tokens	KV (MiB)	n	pure_ms p50	min	max	BW p50 (GB/s)	BW max
512	48	5	5.2	5.1	65.8	9.76	9.81
1024	96	5	10.2	10.1	10.4	9.91	10.00
2048	192	5	20.0	20.0	20.5	10.06	10.07
4096	384	5	40.1	40.1	40.5	10.04	10.05
8192	768	5	80.9	80.7	82.5	9.96	9.98
16384	1536	5	161.8	161.7	164.8	9.96	9.96
32768	3072	5	309.6	307.7	526.9	10.40	10.47
65536	6144	5	1733.6	653.5	1921.2	3.72	9.86
131072	12288	5	2818.4	1283.0	9158.6	4.57	10.04

Side-by-side comparison with the 2026-05-27 intra-node run:

Size	intra p50 ms	inter p50 ms	gap	intra GB/s	inter GB/s
512	5.3	5.2	−2%	9.40	9.76
1024	10.4	10.2	−2%	9.68	9.91
2048	20.6	20.0	−3%	9.75	10.06
4096	41.5	40.1	−3%	9.71	10.04
8192	83.7	80.9	−3%	9.62	9.96
16384	167.1	161.8	−3%	9.64	9.96
32768	320.9	309.6	−3%	10.04	10.40
65536	1895.1	1733.6	−9%	3.40	3.72
131072	2835.1	2818.4	−1%	4.54	4.57

The two paths produce essentially the same numbers — mooncake intra- node is not using NVLink, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths.

Figures: figs/mb2_transfer_time_inter.png, figs/mb2_transfer_bw_inter.png, figs/mb2_transfer_time_compare.png (overlay), figs/mb2_transfer_bw_compare.png.

This collapses the §3.2 narrative to a single number: PD-disagg across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter how you place P and D (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s. Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.

14 KiB Raw Blame History Unescape Escape