# MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1) Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the **Summary** block at the top is what gets cited in the paper. --- ## Summary (latest) | Path | Steady-state BW | Agentic-tail p99 transfer (11.5 GiB KV) | |---|---|---| | **intra-node** (dash1 GPU 0↔1) | **~9.7 GB/s** (96 MiB – 3 GiB) | p50 **1.9 s** · min **1.5 s** · max **10 s** | | **inter-node** (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE) | **~10.0 GB/s** (essentially identical) | p50 **1.7 s** · min **1.3 s** · max **9.2 s** | **Cross-cutting finding** (2026-05-27): **Mooncake transfer cost is topology-independent** on this hardware. Intra-node and inter-node curves are statistically indistinguishable (see `figs/mb2_transfer_time_compare.png`, `figs/mb2_transfer_bw_compare.png`). Mechanism: Mooncake's `batch_transfer_sync_write` always goes through the RDMA NIC, including the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is the bottleneck. **Implication for §3.2**: PD-disaggregation does not get cheaper by co-locating P and D on the same node — the ~9.7 GB/s ceiling applies regardless. Halving the transfer cost cannot be bought back by topology. **Headline for the paper §3.2**: at the agentic tail, **pure KV transfer takes 1.5 – 10 s**. A median agentic decode is **50 – 200 ms** of tool-call output. So **PD-disaggregation adds 8 – 100 × decode-time of transfer on top of every routed request**. Phase isolation (the thing PD-disagg trades transfer cost for) can only win back at most one decode duration — for agentic that's negligible. The arithmetic is one-sided. --- ## Setup | Component | Value | |---|---| | Host | `dash1` (`ds-6348bee4-1-...-rwkv2`), 8× NVIDIA H20 96 GiB, driver 570.133.20 | | Venv | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` (shared via cpfs from any dash host) | | vLLM | 0.18.1 official wheel | | mooncake-transfer-engine | 0.3.11.post1 (`pip install mooncake-transfer-engine`) | | Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` | | Per-token KV | 98304 B | | kv_role | `kv_both` on both instances (see *Known limitations* re kv_producer/kv_consumer) | | Per-instance config | `--tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching` | ## Method 3-step black-box bench: 1. `do_remote_decode` to A (producer) with a client-generated `transfer_id`. `max_tokens=1`; A computes prefill and parks the KV for later pull. 2. `do_remote_prefill` to B (consumer) with the same `transfer_id` plus `remote_engine_id` (from A's `/query` on bootstrap port) and `remote_bootstrap_addr` (`http://127.0.0.1:8998`). **This step triggers the actual KV transfer; it is the measured step.** 3. Plain `completion` on B (`--skip-verify` off): expect `cached_tokens ≈ prompt_len`, confirming the KV landed on B. Per-stage breakdown is obtained by instrumenting the vLLM-shipped `MooncakeConnector` (NOT the mooncake-package's `mooncake_connector_v1`, which vLLM 0.18.1 does not load) at two sites: - **`_send_blocks`** (P-side, line 980): emits `send_blocks` event with `total_bytes`, `duration_s`, `t_start_unix`. The `duration_s` is the wall-time of a single `batch_transfer_sync_write` call — **this is what we call `pure_transfer`**. - **`receive_kv_from_single_worker`** (D-side, line 1139, async): emits `receive_kv_enter` at function start and `receive_kv_finish` on FINISH-status response. The wall-time between them is **`rx_total`** (= ZMQ round-trip + setup + pure_transfer + ack). Pairing across A's and B's logs is by **time window**: each B (enter, finish) pair is matched to the A send_blocks whose `t_start_unix` falls in `[rx_t_start, rx_t_end]`. With single-request benchmarks this is unambiguous. Scripts: - `microbench/fresh_setup/start_vllm_pair.sh` — bring up pair + apply/revert patch - `microbench/fresh_setup/instrument_mooncake.py` — apply/revert MB2 patches - `microbench/fresh_setup/mb2_kv_transfer.py` — client (3-step bench loop) - `microbench/fresh_setup/analyze_mb2.py` — pair A/B events into per-size table - `microbench/fresh_setup/plot_mb2.py` — log-log time + bandwidth curves ## Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both) Raw events: `A_intra_kvboth.jsonl`, `B_intra_kvboth.jsonl`. Joined + aggregated: `intra_kvboth_breakdown.json`. Figures: `figs/mb2_transfer_time_intra.png`, `figs/mb2_transfer_bw_intra.png`. | input_tokens | KV (MiB) | n | pure_ms p50 | pure_ms max | rx_total_ms | overhead_ms | BW p50 (GB/s) | BW max (GB/s) | |---:|---:|---:|---:|---:|---:|---:|---:|---:| | 512 | 48 | 5 | 5.3 | 5.6 | 12.2 | 3.3 | 9.40 | 9.53 | | 1024 | 96 | 5 | 10.4 | 10.5 | 11.9 | 1.5 | 9.68 | 9.72 | | 2048 | 192 | 5 | 20.6 | 21.0 | 22.5 | 1.8 | 9.75 | 9.78 | | 4096 | 384 | 5 | 41.5 | 41.7 | 43.5 | 2.0 | 9.71 | 9.72 | | 8192 | 768 | 5 | 83.7 | 84.4 | 86.2 | 2.2 | 9.62 | 9.69 | | 16384 | 1536 | 5 | 167.1 | 167.7 | 170.2 | 2.7 | 9.64 | 9.67 | | 32768 | 3072 | 5 | 320.9 | 322.1 | 425.2 | 20.5 | 10.04 | 10.09 | | 65536 | 6144 | 5 | **1895.1** | 2375.2 | 1586.1 | 69.6 | 3.40 | 9.68 | | 131072 | 12288 | 5 | **2835.1** | 8923.6 | 4362.5 | 91.4 | 4.54 | 9.67 | **Three regimes** in the data: 1. **<= 3 GiB** — linear in size, bandwidth ≈ **9.7 GB/s steady**. 2. **6 GiB ± a bit** — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 2–3 × longer. 3. **12 GiB** — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region. The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p (claimed ~900 GB/s in IB) — likely the transfer is **PCIe-staged through host memory** rather than NVLink direct. To confirm we would need `nvidia-smi topo -m` and `mooncake_transfer_engine_topology_dump` analysis; not done yet. ## Known limitations of this measurement - **kv_both, not strict PD-disagg.** vLLM 0.18.1 with `kv_role=kv_consumer` raises `AttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server'` (the attribute is only assigned inside `if not self.is_kv_consumer`). The transfer mechanics are identical — same `batch_transfer_sync_write` — so the cost measurement is comparable. The role gate only affects which request types each instance *accepts*. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy. - **Single in-flight request.** All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized. - **Intra-node only.** Inter-node RDMA path will be slower; not yet measured. - **Sanity preamble events.** The raw logs include 6 events from earlier sanity runs in addition to the 45-event sweep. `analyze_mb2.py` treats them as additional samples (same sizes); the per-size aggregates use all of them. ## Implications for §3.2 PD-disagg cost argument For each PD-disagg-routed request, transfer wall-time is: ``` T_transfer(KV_size) = max( pure_transfer(KV_size), rx_overhead ) ≈ KV_size / 9.7 GB/s for KV_size <= 3 GiB ≈ 0.3 – 10 s for KV_size in [3, 12] GiB ``` Agentic decode wall-time is typically 50 – 200 ms (tool-call output of a few tens of tokens at ~50 tok/s). So the **transfer/decode ratio** under intra-node best-case Mooncake is: | KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode | |---|---:|---:|---:| | 192 MiB (2k tok) | 20 ms | 100 ms | 0.2× | | 768 MiB (8k tok) | 84 ms | 100 ms | 0.8× | | 3 GiB (33k tok ≈ trace mean) | 321 ms | 100 ms | **3.2×** | | 6 GiB (~p90) | 1900 ms | 100 ms | **19×** | | 12 GiB (~p99) | 2800 ms | 100 ms | **28×** (median) – **100×** (p99 variance) | PD-disagg's promised payoff is *eliminating prefill–decode interference on the decode instance*. The maximum benefit it can buy is bounded above by the **decode duration itself** (you cannot recover more time than the decode existed). For agentic that's 50 – 200 ms. The cost is the table column above — 0.3 – 10 s of transfer per routed request. **Cost > Benefit by 5× to 100× across the agentic distribution.** Below ~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above 6 GiB even individual draws can take 10 s for a single transfer. This data alone is not the whole §3.2 argument — we still need to account for D-side KV capacity (`f4b`, separate axis), cache reuse loss, and static-partition mismatch (MB3 / MB4 / MB5). But it nails one of the two key cost axes with measured numbers from vanilla mooncake, not the dash0 patched build. ## Open questions / next runs - **Inter-node RDMA**: dash1 ↔ dash2. Expected lower bandwidth (~5–15 GB/s); want to see if the 6 GiB-onset variance moves. - **Bandwidth ceiling investigation**: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config? - **Variance at 6+ GiB**: investigate. Maybe related to chunking inside `batch_transfer_sync_write`, or GPU memory pressure when KV approaches HBM ceiling. - **Concurrent transfers**: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this. - **Strict kv_producer/kv_consumer**: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged. ## Reproduction ```bash # On dash machine with cpfs mount + ssh access: bash microbench/fresh_setup/install.sh # once (idempotent) bash microbench/fresh_setup/deploy.sh dash1 # push scripts to cpfs # bring up pair (intra-node) ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start' # sweep ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \ python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \ --sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \ --repeats 5 --label intra-kvboth \ --out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json' # pull logs scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \ analysis/mb2/A_intra_kvboth.jsonl scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \ analysis/mb2/B_intra_kvboth.jsonl # analyze .venv/bin/python microbench/fresh_setup/analyze_mb2.py \ --a-log analysis/mb2/A_intra_kvboth.jsonl \ --b-log analysis/mb2/B_intra_kvboth.jsonl \ --out analysis/mb2/intra_kvboth_breakdown.json .venv/bin/python microbench/fresh_setup/plot_mb2.py \ --breakdown analysis/mb2/intra_kvboth_breakdown.json \ --out-time figs/mb2_transfer_time_intra.png \ --out-bw figs/mb2_transfer_bw_intra.png # tear down ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop' ``` ## Run log ### 2026-05-27 — intra-node, kv_both, dash1 GPU 0+1 Sweep: `512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072` tokens × 5 repeats. Sanity preamble of `512, 2048, 8192` × 2 included in the raw logs (counted as additional samples for those sizes). Result table above. **9.7 GB/s steady-state up to 3 GiB**, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s. Committed as `de164e5`. ### 2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0 Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping). Producer A on dash1 GPU 0, consumer B on dash2 GPU 0. remote_bootstrap_addr=`http://172.27.123.142:8998` (dash1's internal IP). Raw events: `A_inter_kvboth.jsonl` (45 send_blocks + 6 sanity). B's receive_kv events are **missing** for this run — the `MB2_LOG_DIR` env var did not propagate from the start-script through vLLM's EngineCore subprocess on dash2 (visible via `cat /proc/$ENGINE_PID/environ` shows empty for dash2 but contains MB2_LOG_DIR for dash1 — bookmark for future investigation, likely spawn-vs-fork difference in vLLM's multiproc executor across hosts). Pure-transfer numbers below come from A's send_blocks alone; full rx_total breakdown not available for this run. Per-size pure-transfer (analyzed by `analyze_mb2_send_only.py`): | input_tokens | KV (MiB) | n | pure_ms p50 | min | max | BW p50 (GB/s) | BW max | |---:|---:|---:|---:|---:|---:|---:|---:| | 512 | 48 | 5 | 5.2 | 5.1 | 65.8 | 9.76 | 9.81 | | 1024 | 96 | 5 | 10.2 | 10.1 | 10.4 | 9.91 | 10.00 | | 2048 | 192 | 5 | 20.0 | 20.0 | 20.5 | 10.06 | 10.07 | | 4096 | 384 | 5 | 40.1 | 40.1 | 40.5 | 10.04 | 10.05 | | 8192 | 768 | 5 | 80.9 | 80.7 | 82.5 | 9.96 | 9.98 | | 16384 | 1536 | 5 | 161.8 | 161.7 | 164.8 | 9.96 | 9.96 | | 32768 | 3072 | 5 | 309.6 | 307.7 | 526.9 | 10.40 | 10.47 | | 65536 | 6144 | 5 | 1733.6 | 653.5 | 1921.2 | 3.72 | 9.86 | | 131072 | 12288 | 5 | 2818.4 | 1283.0 | 9158.6 | 4.57 | 10.04 | Side-by-side comparison with the 2026-05-27 intra-node run: | Size | intra p50 ms | inter p50 ms | gap | intra GB/s | inter GB/s | |---|---:|---:|---:|---:|---:| | 512 | 5.3 | 5.2 | −2% | 9.40 | 9.76 | | 1024 | 10.4 | 10.2 | −2% | 9.68 | 9.91 | | 2048 | 20.6 | 20.0 | −3% | 9.75 | 10.06 | | 4096 | 41.5 | 40.1 | −3% | 9.71 | 10.04 | | 8192 | 83.7 | 80.9 | −3% | 9.62 | 9.96 | | 16384 | 167.1 | 161.8 | −3% | 9.64 | 9.96 | | 32768 | 320.9 | 309.6 | −3% | 10.04 | 10.40 | | 65536 | 1895.1 | 1733.6 | −9% | 3.40 | 3.72 | |131072 | 2835.1 | 2818.4 | −1% | 4.54 | 4.57 | The two paths produce essentially the same numbers — **mooncake intra- node is not using NVLink**, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths. Figures: `figs/mb2_transfer_time_inter.png`, `figs/mb2_transfer_bw_inter.png`, `figs/mb2_transfer_time_compare.png` (overlay), `figs/mb2_transfer_bw_compare.png`. This collapses the §3.2 narrative to a single number: **PD-disagg across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter how you place P and D** (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s. Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.