# MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)

Persistent record of the per-stage KV transfer microbench used in §3.2 of
the EAR paper. Re-runs append a dated section at the bottom; the
**Summary** block at the top is what gets cited in the paper.

---

## Summary (latest)

| Path | Steady-state BW | Agentic-tail p99 transfer (11.5 GiB KV) |
|---|---|---|
| **intra-node** (dash1 GPU 0↔1) | **~9.7 GB/s** (96 MiB – 3 GiB) | p50 **1.9 s** · min **1.5 s** · max **10 s** |
| **inter-node** (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE) | **~10.0 GB/s** (essentially identical) | p50 **1.7 s** · min **1.3 s** · max **9.2 s** |

**Cross-cutting finding** (2026-05-27): **Mooncake transfer cost is
topology-independent** on this hardware. Intra-node and inter-node curves
are statistically indistinguishable (see `figs/mb2_transfer_time_compare.png`,
`figs/mb2_transfer_bw_compare.png`). Mechanism: Mooncake's
`batch_transfer_sync_write` always goes through the RDMA NIC, including
the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is
the bottleneck. **Implication for §3.2**: PD-disaggregation does not
get cheaper by co-locating P and D on the same node — the ~9.7 GB/s
ceiling applies regardless. Halving the transfer cost cannot be bought
back by topology.

**Headline for the paper §3.2**: at the agentic tail, **pure KV transfer
takes 1.5 – 10 s**. A median agentic decode is **50 – 200 ms** of tool-call
output. So **PD-disaggregation adds 8 – 100 × decode-time of transfer on
top of every routed request**. Phase isolation (the thing PD-disagg
trades transfer cost for) can only win back at most one decode duration
— for agentic that's negligible. The arithmetic is one-sided.

---

## Setup

| Component | Value |
|---|---|
| Host | `dash1` (`ds-6348bee4-1-...-rwkv2`), 8× NVIDIA H20 96 GiB, driver 570.133.20 |
| Venv  | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` (shared via cpfs from any dash host) |
| vLLM | 0.18.1 official wheel |
| mooncake-transfer-engine | 0.3.11.post1 (`pip install mooncake-transfer-engine`) |
| Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` |
| Per-token KV | 98304 B |
| kv_role | `kv_both` on both instances (see *Known limitations* re kv_producer/kv_consumer) |
| Per-instance config | `--tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching` |

## Method

3-step black-box bench:

1. `do_remote_decode` to A (producer) with a client-generated `transfer_id`.
   `max_tokens=1`; A computes prefill and parks the KV for later pull.
2. `do_remote_prefill` to B (consumer) with the same `transfer_id` plus
   `remote_engine_id` (from A's `/query` on bootstrap port) and
   `remote_bootstrap_addr` (`http://127.0.0.1:8998`). **This step
   triggers the actual KV transfer; it is the measured step.**
3. Plain `completion` on B (`--skip-verify` off): expect `cached_tokens
   ≈ prompt_len`, confirming the KV landed on B.

Per-stage breakdown is obtained by instrumenting the vLLM-shipped
`MooncakeConnector` (NOT the mooncake-package's `mooncake_connector_v1`,
which vLLM 0.18.1 does not load) at two sites:

- **`_send_blocks`** (P-side, line 980): emits `send_blocks` event with
  `total_bytes`, `duration_s`, `t_start_unix`. The `duration_s` is the
  wall-time of a single `batch_transfer_sync_write` call — **this is
  what we call `pure_transfer`**.
- **`receive_kv_from_single_worker`** (D-side, line 1139, async):
  emits `receive_kv_enter` at function start and `receive_kv_finish`
  on FINISH-status response. The wall-time between them is
  **`rx_total`** (= ZMQ round-trip + setup + pure_transfer + ack).

Pairing across A's and B's logs is by **time window**: each B
(enter, finish) pair is matched to the A send_blocks whose
`t_start_unix` falls in `[rx_t_start, rx_t_end]`. With single-request
benchmarks this is unambiguous.

Scripts:
- `microbench/fresh_setup/start_vllm_pair.sh` — bring up pair + apply/revert patch
- `microbench/fresh_setup/instrument_mooncake.py` — apply/revert MB2 patches
- `microbench/fresh_setup/mb2_kv_transfer.py` — client (3-step bench loop)
- `microbench/fresh_setup/analyze_mb2.py` — pair A/B events into per-size table
- `microbench/fresh_setup/plot_mb2.py` — log-log time + bandwidth curves

## Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)

Raw events: `A_intra_kvboth.jsonl`, `B_intra_kvboth.jsonl`.
Joined + aggregated: `intra_kvboth_breakdown.json`.
Figures: `figs/mb2_transfer_time_intra.png`, `figs/mb2_transfer_bw_intra.png`.

| input_tokens | KV (MiB) | n | pure_ms p50 | pure_ms max | rx_total_ms | overhead_ms | BW p50 (GB/s) | BW max (GB/s) |
|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|    512 |    48 | 5 |   5.3 |   5.6 |  12.2 |  3.3 | 9.40 |  9.53 |
|   1024 |    96 | 5 |  10.4 |  10.5 |  11.9 |  1.5 | 9.68 |  9.72 |
|   2048 |   192 | 5 |  20.6 |  21.0 |  22.5 |  1.8 | 9.75 |  9.78 |
|   4096 |   384 | 5 |  41.5 |  41.7 |  43.5 |  2.0 | 9.71 |  9.72 |
|   8192 |   768 | 5 |  83.7 |  84.4 |  86.2 |  2.2 | 9.62 |  9.69 |
|  16384 |  1536 | 5 | 167.1 | 167.7 | 170.2 |  2.7 | 9.64 |  9.67 |
|  32768 |  3072 | 5 | 320.9 | 322.1 | 425.2 | 20.5 | 10.04 | 10.09 |
|  65536 |  6144 | 5 | **1895.1** | 2375.2 | 1586.1 | 69.6 | 3.40 | 9.68 |
| 131072 | 12288 | 5 | **2835.1** | 8923.6 | 4362.5 | 91.4 | 4.54 | 9.67 |

**Three regimes** in the data:

1. **<= 3 GiB** — linear in size, bandwidth ≈ **9.7 GB/s steady**.
2. **6 GiB ± a bit** — onset of variance: max bandwidth still 9.7 GB/s,
   but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others
   take 2–3 × longer.
3. **12 GiB** — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB
   transfer). This is the agentic-p99 size region.

The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p
(claimed ~900 GB/s in IB) — likely the transfer is **PCIe-staged
through host memory** rather than NVLink direct. To confirm we would
need `nvidia-smi topo -m` and `mooncake_transfer_engine_topology_dump`
analysis; not done yet.

## Known limitations of this measurement

- **kv_both, not strict PD-disagg.** vLLM 0.18.1 with
  `kv_role=kv_consumer` raises `AttributeError: 'MooncakeConnectorWorker'
  object has no attribute 'bootstrap_server'` (the attribute is only
  assigned inside `if not self.is_kv_consumer`). The transfer mechanics
  are identical — same `batch_transfer_sync_write` — so the cost
  measurement is comparable. The role gate only affects which request
  types each instance *accepts*. §5.2 strict PD-disagg baseline will
  need either to fix that bug or front the pair with a role-aware proxy.
- **Single in-flight request.** All measurements here are serial.
  Real PD-disagg will have many concurrent transfers; bandwidth
  contention is not characterized.
- **Intra-node only.** Inter-node RDMA path will be slower; not yet
  measured.
- **Sanity preamble events.** The raw logs include 6 events from
  earlier sanity runs in addition to the 45-event sweep. `analyze_mb2.py`
  treats them as additional samples (same sizes); the per-size
  aggregates use all of them.

## Implications for §3.2 PD-disagg cost argument

For each PD-disagg-routed request, transfer wall-time is:

```
T_transfer(KV_size) = max(  pure_transfer(KV_size),  rx_overhead  )
                    ≈ KV_size / 9.7 GB/s   for KV_size <= 3 GiB
                    ≈ 0.3 – 10 s            for KV_size in [3, 12] GiB
```

Agentic decode wall-time is typically 50 – 200 ms (tool-call output of
a few tens of tokens at ~50 tok/s). So the **transfer/decode ratio**
under intra-node best-case Mooncake is:

| KV size | T_transfer @9.7 GB/s | typical decode | T_transfer / T_decode |
|---|---:|---:|---:|
| 192 MiB (2k tok)   |   20 ms | 100 ms | 0.2× |
| 768 MiB (8k tok)   |   84 ms | 100 ms | 0.8× |
| 3 GiB (33k tok ≈ trace mean) |  321 ms | 100 ms | **3.2×** |
| 6 GiB (~p90)       | 1900 ms | 100 ms | **19×**  |
| 12 GiB (~p99)      | 2800 ms | 100 ms | **28×** (median) – **100×** (p99 variance) |

PD-disagg's promised payoff is *eliminating prefill–decode interference
on the decode instance*. The maximum benefit it can buy is bounded
above by the **decode duration itself** (you cannot recover more time
than the decode existed). For agentic that's 50 – 200 ms. The cost is
the table column above — 0.3 – 10 s of transfer per routed request.

**Cost > Benefit by 5× to 100× across the agentic distribution.** Below
~3 GiB the ratio is small (≤1×); above 3 GiB the ratio explodes; above
6 GiB even individual draws can take 10 s for a single transfer.

This data alone is not the whole §3.2 argument — we still need to
account for D-side KV capacity (`f4b`, separate axis), cache reuse loss,
and static-partition mismatch (MB3 / MB4 / MB5). But it nails one
of the two key cost axes with measured numbers from vanilla mooncake,
not the dash0 patched build.

## Open questions / next runs

- **Inter-node RDMA**: dash1 ↔ dash2. Expected lower bandwidth (~5–15
  GB/s); want to see if the 6 GiB-onset variance moves.
- **Bandwidth ceiling investigation**: is the 9.7 GB/s ceiling PCIe (so
  the connector is not using NVLink direct) or some internal limit? If
  PCIe, can it be lifted with NVLink-direct mooncake config?
- **Variance at 6+ GiB**: investigate. Maybe related to chunking
  inside `batch_transfer_sync_write`, or GPU memory pressure when KV
  approaches HBM ceiling.
- **Concurrent transfers**: measure aggregate bandwidth when N
  simultaneous transfers happen. PD-disagg in practice does this.
- **Strict kv_producer/kv_consumer**: patch the bootstrap_server bug
  or use a proxy; verify transfer time is unchanged.

## Reproduction

```bash
# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh        # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1    # push scripts to cpfs

# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
    --sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
    --repeats 5 --label intra-kvboth \
    --out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'

# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/B_intra_kvboth.jsonl

# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
  --a-log analysis/mb2/A_intra_kvboth.jsonl \
  --b-log analysis/mb2/B_intra_kvboth.jsonl \
  --out analysis/mb2/intra_kvboth_breakdown.json

.venv/bin/python microbench/fresh_setup/plot_mb2.py \
  --breakdown analysis/mb2/intra_kvboth_breakdown.json \
  --out-time figs/mb2_transfer_time_intra.png \
  --out-bw figs/mb2_transfer_bw_intra.png

# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'
```

## Run log

### 2026-05-27 — intra-node, kv_both, dash1 GPU 0+1

Sweep: `512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072` tokens
× 5 repeats. Sanity preamble of `512, 2048, 8192` × 2 included in the
raw logs (counted as additional samples for those sizes).

Result table above. **9.7 GB/s steady-state up to 3 GiB**, variance
opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s.

Committed as `de164e5`.

### 2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0

Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping).
Producer A on dash1 GPU 0, consumer B on dash2 GPU 0.
remote_bootstrap_addr=`http://172.27.123.142:8998` (dash1's internal IP).

Raw events: `A_inter_kvboth.jsonl` (45 send_blocks + 6 sanity).
B's receive_kv events are **missing** for this run — the
`MB2_LOG_DIR` env var did not propagate from the start-script through
vLLM's EngineCore subprocess on dash2 (visible via
`cat /proc/$ENGINE_PID/environ` shows empty for dash2 but contains
MB2_LOG_DIR for dash1 — bookmark for future investigation, likely
spawn-vs-fork difference in vLLM's multiproc executor across hosts).
Pure-transfer numbers below come from A's send_blocks alone; full
rx_total breakdown not available for this run.

Per-size pure-transfer (analyzed by `analyze_mb2_send_only.py`):

| input_tokens | KV (MiB) | n | pure_ms p50 | min | max | BW p50 (GB/s) | BW max |
|---:|---:|---:|---:|---:|---:|---:|---:|
|    512 |    48 | 5 |   5.2 |   5.1 |  65.8 |  9.76 |  9.81 |
|   1024 |    96 | 5 |  10.2 |  10.1 |  10.4 |  9.91 | 10.00 |
|   2048 |   192 | 5 |  20.0 |  20.0 |  20.5 | 10.06 | 10.07 |
|   4096 |   384 | 5 |  40.1 |  40.1 |  40.5 | 10.04 | 10.05 |
|   8192 |   768 | 5 |  80.9 |  80.7 |  82.5 |  9.96 |  9.98 |
|  16384 |  1536 | 5 | 161.8 | 161.7 | 164.8 |  9.96 |  9.96 |
|  32768 |  3072 | 5 | 309.6 | 307.7 | 526.9 | 10.40 | 10.47 |
|  65536 |  6144 | 5 | 1733.6 | 653.5 | 1921.2 | 3.72 | 9.86 |
| 131072 | 12288 | 5 | 2818.4 | 1283.0 | 9158.6 | 4.57 | 10.04 |

Side-by-side comparison with the 2026-05-27 intra-node run:

| Size | intra p50 ms | inter p50 ms | gap | intra GB/s | inter GB/s |
|---|---:|---:|---:|---:|---:|
|   512 |    5.3 |    5.2 | −2% | 9.40 |  9.76 |
|  1024 |   10.4 |   10.2 | −2% | 9.68 |  9.91 |
|  2048 |   20.6 |   20.0 | −3% | 9.75 | 10.06 |
|  4096 |   41.5 |   40.1 | −3% | 9.71 | 10.04 |
|  8192 |   83.7 |   80.9 | −3% | 9.62 |  9.96 |
| 16384 |  167.1 |  161.8 | −3% | 9.64 |  9.96 |
| 32768 |  320.9 |  309.6 | −3% | 10.04 | 10.40 |
| 65536 | 1895.1 | 1733.6 | −9% | 3.40 |  3.72 |
|131072 | 2835.1 | 2818.4 | −1% | 4.54 |  4.57 |

The two paths produce essentially the same numbers — **mooncake intra-
node is not using NVLink**, it's going through RDMA-loopback on the
local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The
6+ GiB variance regime is also identical between paths.

Figures: `figs/mb2_transfer_time_inter.png`, `figs/mb2_transfer_bw_inter.png`,
`figs/mb2_transfer_time_compare.png` (overlay), `figs/mb2_transfer_bw_compare.png`.

This collapses the §3.2 narrative to a single number: **PD-disagg
across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter
how you place P and D** (within-node or across-node). For p99 agentic
KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s.
Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.