Files
agentic-kvc/analysis/mb2
Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 22:04:49 +08:00
..

MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)

Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the Summary block at the top is what gets cited in the paper.


Summary (latest)

Path Steady-state BW Agentic-tail p99 transfer (11.5 GiB KV)
intra-node (dash1 GPU 0↔1) ~9.7 GB/s (96 MiB 3 GiB) p50 1.9 s · min 1.5 s · max 10 s
inter-node (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE) ~10.0 GB/s (essentially identical) p50 1.7 s · min 1.3 s · max 9.2 s

Cross-cutting finding (2026-05-27): Mooncake transfer cost is topology-independent on this hardware. Intra-node and inter-node curves are statistically indistinguishable (see figs/mb2_transfer_time_compare.png, figs/mb2_transfer_bw_compare.png). Mechanism: Mooncake's batch_transfer_sync_write always goes through the RDMA NIC, including the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is the bottleneck. Implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — the ~9.7 GB/s ceiling applies regardless. Halving the transfer cost cannot be bought back by topology.

What MB2 actually measures: the per-request charge that PD-disagg pays for every routed request — T_transfer ≈ KV_size / 9.7 GB/s. For agentic this is 8 ms (192 MiB / trace lower) 1.9 s (11.5 GiB / p99).

⚠ Correction (2026-05-27): an earlier version of this README framed §3.2 as "transfer cost (1.510 s) >> decode duration (50200 ms), so PD-disagg loses on cost-vs-benefit." That accounting was wrong: PD-disagg's phase-isolation benefit is per-prefill-event and equals D × T_prefill (aggregate across stalled decode streams), not the single-request decode duration. With trace-mean T_prefill = 4.5 s and D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer cost. PD-disagg's phase-isolation axis is a win, not a loss.

The actual reason static PD-disagg fails in agentic is D-side KV capacity (figs/f4b_pdsep_kv_wall.png), not a cost-vs-benefit imbalance. See RESULTS_SUMMARY.md section 4 for the corrected framing. MB2 still serves as the source of the per-request transfer cost number used in that analysis.


Setup

Component Value
Host dash1 (ds-6348bee4-1-...-rwkv2), 8× NVIDIA H20 96 GiB, driver 570.133.20
Venv /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (shared via cpfs from any dash host)
vLLM 0.18.1 official wheel
mooncake-transfer-engine 0.3.11.post1 (pip install mooncake-transfer-engine)
Model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
Per-token KV 98304 B
kv_role kv_both on both instances (see Known limitations re kv_producer/kv_consumer)
Per-instance config --tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching

Method

3-step black-box bench:

  1. do_remote_decode to A (producer) with a client-generated transfer_id. max_tokens=1; A computes prefill and parks the KV for later pull.
  2. do_remote_prefill to B (consumer) with the same transfer_id plus remote_engine_id (from A's /query on bootstrap port) and remote_bootstrap_addr (http://127.0.0.1:8998). This step triggers the actual KV transfer; it is the measured step.
  3. Plain completion on B (--skip-verify off): expect cached_tokens ≈ prompt_len, confirming the KV landed on B.

Per-stage breakdown is obtained by instrumenting the vLLM-shipped MooncakeConnector (NOT the mooncake-package's mooncake_connector_v1, which vLLM 0.18.1 does not load) at two sites:

  • _send_blocks (P-side, line 980): emits send_blocks event with total_bytes, duration_s, t_start_unix. The duration_s is the wall-time of a single batch_transfer_sync_write call — this is what we call pure_transfer.
  • receive_kv_from_single_worker (D-side, line 1139, async): emits receive_kv_enter at function start and receive_kv_finish on FINISH-status response. The wall-time between them is rx_total (= ZMQ round-trip + setup + pure_transfer + ack).

Pairing across A's and B's logs is by time window: each B (enter, finish) pair is matched to the A send_blocks whose t_start_unix falls in [rx_t_start, rx_t_end]. With single-request benchmarks this is unambiguous.

Scripts:

  • microbench/fresh_setup/start_vllm_pair.sh — bring up pair + apply/revert patch
  • microbench/fresh_setup/instrument_mooncake.py — apply/revert MB2 patches
  • microbench/fresh_setup/mb2_kv_transfer.py — client (3-step bench loop)
  • microbench/fresh_setup/analyze_mb2.py — pair A/B events into per-size table
  • microbench/fresh_setup/plot_mb2.py — log-log time + bandwidth curves

Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)

Raw events: A_intra_kvboth.jsonl, B_intra_kvboth.jsonl. Joined + aggregated: intra_kvboth_breakdown.json. Figures: figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png.

input_tokens KV (MiB) n pure_ms p50 pure_ms max rx_total_ms overhead_ms BW p50 (GB/s) BW max (GB/s)
512 48 5 5.3 5.6 12.2 3.3 9.40 9.53
1024 96 5 10.4 10.5 11.9 1.5 9.68 9.72
2048 192 5 20.6 21.0 22.5 1.8 9.75 9.78
4096 384 5 41.5 41.7 43.5 2.0 9.71 9.72
8192 768 5 83.7 84.4 86.2 2.2 9.62 9.69
16384 1536 5 167.1 167.7 170.2 2.7 9.64 9.67
32768 3072 5 320.9 322.1 425.2 20.5 10.04 10.09
65536 6144 5 1895.1 2375.2 1586.1 69.6 3.40 9.68
131072 12288 5 2835.1 8923.6 4362.5 91.4 4.54 9.67

Three regimes in the data:

  1. <= 3 GiB — linear in size, bandwidth ≈ 9.7 GB/s steady.
  2. 6 GiB ± a bit — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 23 × longer.
  3. 12 GiB — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region.

The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p (claimed ~900 GB/s in IB) — likely the transfer is PCIe-staged through host memory rather than NVLink direct. To confirm we would need nvidia-smi topo -m and mooncake_transfer_engine_topology_dump analysis; not done yet.

Known limitations of this measurement

  • kv_both, not strict PD-disagg. vLLM 0.18.1 with kv_role=kv_consumer raises AttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server' (the attribute is only assigned inside if not self.is_kv_consumer). The transfer mechanics are identical — same batch_transfer_sync_write — so the cost measurement is comparable. The role gate only affects which request types each instance accepts. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy.
  • Single in-flight request. All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized.
  • Intra-node only. Inter-node RDMA path will be slower; not yet measured.
  • Sanity preamble events. The raw logs include 6 events from earlier sanity runs in addition to the 45-event sweep. analyze_mb2.py treats them as additional samples (same sizes); the per-size aggregates use all of them.

Implications for §3.2 PD-disagg argument

For each PD-disagg-routed request, transfer wall-time is:

T_transfer(KV_size) ≈ KV_size / 9.7 GB/s   for KV_size ≤ 3 GiB
                    ≈ 0.3  10 s            for KV_size in [3, 12] GiB

This is the per-request transfer charge of PD-disagg. It's a real cost, but in the context of phase-isolation accounting it is small compared to the benefit:

Prefill T_prefill (MB1) T_transfer (MB2) Phase-isolation benefit at D=8 = D × T_prefill
2k tok (trace lower) 0.14 s 8 ms 1.1 s
33k tok (trace mean) 4.5 s 320 ms 36 s
125k tok (~p99) 57 s 1.9 s 456 s

On the phase-isolation axis alone, PD-disagg recovers two orders of magnitude more decode time than it pays in transfer. It is NOT this axis that defeats static PD-disagg in agentic — see colleague's 4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is driven by D-side KV-pool overflow on long-context requests (figs/f4b_pdsep_kv_wall.png), not by transfer latency.

What MB2 contributes to the paper is therefore:

  • The per-request transfer cost number (used as the cost input to the cost-benefit accounting above).
  • The empirical observation that Mooncake's transfer cost is topology-independent — intra-node and inter-node both go through the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's transfer cost does not get cheaper by co-locating P and D.

The dominant §3.2 failure mode of static PD-disagg in agentic is capacity, not transfer cost. MB3 / MB4 / MB5 will quantify the remaining axes (D-pool occupancy, cache reuse degradation under PD routing, static-partition mismatch).

Open questions / next runs

  • Inter-node RDMA: dash1 ↔ dash2. Expected lower bandwidth (~515 GB/s); want to see if the 6 GiB-onset variance moves.
  • Bandwidth ceiling investigation: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config?
  • Variance at 6+ GiB: investigate. Maybe related to chunking inside batch_transfer_sync_write, or GPU memory pressure when KV approaches HBM ceiling.
  • Concurrent transfers: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this.
  • Strict kv_producer/kv_consumer: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged.

Reproduction

# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh        # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1    # push scripts to cpfs

# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
    --sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
    --repeats 5 --label intra-kvboth \
    --out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'

# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/B_intra_kvboth.jsonl

# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
  --a-log analysis/mb2/A_intra_kvboth.jsonl \
  --b-log analysis/mb2/B_intra_kvboth.jsonl \
  --out analysis/mb2/intra_kvboth_breakdown.json

.venv/bin/python microbench/fresh_setup/plot_mb2.py \
  --breakdown analysis/mb2/intra_kvboth_breakdown.json \
  --out-time figs/mb2_transfer_time_intra.png \
  --out-bw figs/mb2_transfer_bw_intra.png

# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'

Run log

2026-05-27 — intra-node, kv_both, dash1 GPU 0+1

Sweep: 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072 tokens × 5 repeats. Sanity preamble of 512, 2048, 8192 × 2 included in the raw logs (counted as additional samples for those sizes).

Result table above. 9.7 GB/s steady-state up to 3 GiB, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 10 s.

Committed as de164e5.

2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0

Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping). Producer A on dash1 GPU 0, consumer B on dash2 GPU 0. remote_bootstrap_addr=http://172.27.123.142:8998 (dash1's internal IP).

Raw events: A_inter_kvboth.jsonl (45 send_blocks + 6 sanity). B's receive_kv events are missing for this run — the MB2_LOG_DIR env var did not propagate from the start-script through vLLM's EngineCore subprocess on dash2 (visible via cat /proc/$ENGINE_PID/environ shows empty for dash2 but contains MB2_LOG_DIR for dash1 — bookmark for future investigation, likely spawn-vs-fork difference in vLLM's multiproc executor across hosts). Pure-transfer numbers below come from A's send_blocks alone; full rx_total breakdown not available for this run.

Per-size pure-transfer (analyzed by analyze_mb2_send_only.py):

input_tokens KV (MiB) n pure_ms p50 min max BW p50 (GB/s) BW max
512 48 5 5.2 5.1 65.8 9.76 9.81
1024 96 5 10.2 10.1 10.4 9.91 10.00
2048 192 5 20.0 20.0 20.5 10.06 10.07
4096 384 5 40.1 40.1 40.5 10.04 10.05
8192 768 5 80.9 80.7 82.5 9.96 9.98
16384 1536 5 161.8 161.7 164.8 9.96 9.96
32768 3072 5 309.6 307.7 526.9 10.40 10.47
65536 6144 5 1733.6 653.5 1921.2 3.72 9.86
131072 12288 5 2818.4 1283.0 9158.6 4.57 10.04

Side-by-side comparison with the 2026-05-27 intra-node run:

Size intra p50 ms inter p50 ms gap intra GB/s inter GB/s
512 5.3 5.2 2% 9.40 9.76
1024 10.4 10.2 2% 9.68 9.91
2048 20.6 20.0 3% 9.75 10.06
4096 41.5 40.1 3% 9.71 10.04
8192 83.7 80.9 3% 9.62 9.96
16384 167.1 161.8 3% 9.64 9.96
32768 320.9 309.6 3% 10.04 10.40
65536 1895.1 1733.6 9% 3.40 3.72
131072 2835.1 2818.4 1% 4.54 4.57

The two paths produce essentially the same numbers — mooncake intra- node is not using NVLink, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths.

Figures: figs/mb2_transfer_time_inter.png, figs/mb2_transfer_bw_inter.png, figs/mb2_transfer_time_compare.png (overlay), figs/mb2_transfer_bw_compare.png.

This collapses the §3.2 narrative to a single number: PD-disagg across this cluster costs ~9.710 GB/s of transfer bandwidth no matter how you place P and D (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.310 s of transfer; for 6 GiB it's 0.72 s. Decode is 50200 ms. So PD-disagg's cost dominates regardless of layout.