Files

Gahow Wang da39ab6804 Correct PD-disagg cost/benefit framing across repo

The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.

What was wrong:
  I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
  of the new request (~50–200 ms)" — implicitly treating the benefit as
  per-request and bounded by that request's own decode. The correct
  accounting is per-prefill-event across all stalled streams:

      benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
                          ≈ D × T_prefill

  which follows from the chunked-prefill math (each of L/N chunks slows
  D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).

Plug MB1 + MB2 numbers in:

  prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
   2k tok      | 0.14 s    |     8 ms   |   1.1 s     |    0.7 %
  33k tok      | 4.5  s    |  320 ms    |  36   s     |    0.9 %
 125k tok      | 57   s    |  1.9 s     | 456   s     |    0.4 %

On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.

The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.

Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
  function; keep mb1_interference.png and update its title to note
  per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
  no more "max benefit = decode duration" claim); §3.2 implications
  section replaced with the corrected per-prefill-event table; explicit
  ⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
  rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
  capacity argument (the real failure mode), MB1/MB2 demoted from
  "kill-shot for PD-disagg" to "supporting context inputs to a
  cost-benefit table that actually favors PD-disagg on this axis";
  §6 paper-claims list reordered to remove the wrong "PD-disagg loses
  on cost-vs-benefit" claim and replace with the corrected ones

PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 22:04:49 +08:00

15 KiB

Raw Blame History

MB2 — Mooncake KV Transfer Cost (vanilla vLLM 0.18.1)

Persistent record of the per-stage KV transfer microbench used in §3.2 of the EAR paper. Re-runs append a dated section at the bottom; the Summary block at the top is what gets cited in the paper.

Summary (latest)

Path	Steady-state BW	Agentic-tail p99 transfer (11.5 GiB KV)
intra-node (dash1 GPU 0↔1)	~9.7 GB/s (96 MiB – 3 GiB)	p50 1.9 s · min 1.5 s · max 10 s
inter-node (dash1 GPU0 → dash2 GPU0, 200 Gbps RoCE)	~10.0 GB/s (essentially identical)	p50 1.7 s · min 1.3 s · max 9.2 s

Cross-cutting finding (2026-05-27): Mooncake transfer cost is topology-independent on this hardware. Intra-node and inter-node curves are statistically indistinguishable (see figs/mb2_transfer_time_compare.png, figs/mb2_transfer_bw_compare.png). Mechanism: Mooncake's batch_transfer_sync_write always goes through the RDMA NIC, including the intra-node case (RDMA loopback). The 200 Gbps NIC, not NVLink, is the bottleneck. Implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — the ~9.7 GB/s ceiling applies regardless. Halving the transfer cost cannot be bought back by topology.

What MB2 actually measures: the per-request charge that PD-disagg pays for every routed request — T_transfer ≈ KV_size / 9.7 GB/s. For agentic this is 8 ms (192 MiB / trace lower) – 1.9 s (11.5 GiB / p99).

⚠ Correction (2026-05-27): an earlier version of this README framed §3.2 as "transfer cost (1.5–10 s) >> decode duration (50–200 ms), so PD-disagg loses on cost-vs-benefit." That accounting was wrong: PD-disagg's phase-isolation benefit is per-prefill-event and equals D × T_prefill (aggregate across stalled decode streams), not the single-request decode duration. With trace-mean T_prefill = 4.5 s and D = 8, the benefit is ~36 s — far larger than the ~0.32 s transfer cost. PD-disagg's phase-isolation axis is a win, not a loss.

The actual reason static PD-disagg fails in agentic is D-side KV capacity (figs/f4b_pdsep_kv_wall.png), not a cost-vs-benefit imbalance. See RESULTS_SUMMARY.md section 4 for the corrected framing. MB2 still serves as the source of the per-request transfer cost number used in that analysis.

Setup

Component	Value
Host	`dash1` (`ds-6348bee4-1-...-rwkv2`), 8× NVIDIA H20 96 GiB, driver 570.133.20
Venv	`/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` (shared via cpfs from any dash host)
vLLM	0.18.1 official wheel
mooncake-transfer-engine	0.3.11.post1 (`pip install mooncake-transfer-engine`)
Model	`/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`
Per-token KV	98304 B
kv_role	`kv_both` on both instances (see Known limitations re kv_producer/kv_consumer)
Per-instance config	`--tensor-parallel-size 1 --gpu-memory-utilization 0.9 --max-model-len 200000 --enable-prefix-caching`

Method

3-step black-box bench:

do_remote_decode to A (producer) with a client-generated transfer_id. max_tokens=1; A computes prefill and parks the KV for later pull.
do_remote_prefill to B (consumer) with the same transfer_id plus remote_engine_id (from A's /query on bootstrap port) and remote_bootstrap_addr (http://127.0.0.1:8998). This step triggers the actual KV transfer; it is the measured step.
Plain completion on B (--skip-verify off): expect cached_tokens ≈ prompt_len, confirming the KV landed on B.

Per-stage breakdown is obtained by instrumenting the vLLM-shipped MooncakeConnector (NOT the mooncake-package's mooncake_connector_v1, which vLLM 0.18.1 does not load) at two sites:

_send_blocks (P-side, line 980): emits send_blocks event with total_bytes, duration_s, t_start_unix. The duration_s is the wall-time of a single batch_transfer_sync_write call — this is what we call pure_transfer.
receive_kv_from_single_worker (D-side, line 1139, async): emits receive_kv_enter at function start and receive_kv_finish on FINISH-status response. The wall-time between them is rx_total (= ZMQ round-trip + setup + pure_transfer + ack).

Pairing across A's and B's logs is by time window: each B (enter, finish) pair is matched to the A send_blocks whose t_start_unix falls in [rx_t_start, rx_t_end]. With single-request benchmarks this is unambiguous.

Scripts:

microbench/fresh_setup/start_vllm_pair.sh — bring up pair + apply/revert patch
microbench/fresh_setup/instrument_mooncake.py — apply/revert MB2 patches
microbench/fresh_setup/mb2_kv_transfer.py — client (3-step bench loop)
microbench/fresh_setup/analyze_mb2.py — pair A/B events into per-size table
microbench/fresh_setup/plot_mb2.py — log-log time + bandwidth curves

Results — intra-node (2026-05-27, dash1 GPU 0+1, kv_both)

Raw events: A_intra_kvboth.jsonl, B_intra_kvboth.jsonl. Joined + aggregated: intra_kvboth_breakdown.json. Figures: figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png.

input_tokens	KV (MiB)	n	pure_ms p50	pure_ms max	rx_total_ms	overhead_ms	BW p50 (GB/s)	BW max (GB/s)
512	48	5	5.3	5.6	12.2	3.3	9.40	9.53
1024	96	5	10.4	10.5	11.9	1.5	9.68	9.72
2048	192	5	20.6	21.0	22.5	1.8	9.75	9.78
4096	384	5	41.5	41.7	43.5	2.0	9.71	9.72
8192	768	5	83.7	84.4	86.2	2.2	9.62	9.69
16384	1536	5	167.1	167.7	170.2	2.7	9.64	9.67
32768	3072	5	320.9	322.1	425.2	20.5	10.04	10.09
65536	6144	5	1895.1	2375.2	1586.1	69.6	3.40	9.68
131072	12288	5	2835.1	8923.6	4362.5	91.4	4.54	9.67

Three regimes in the data:

<= 3 GiB — linear in size, bandwidth ≈ 9.7 GB/s steady.
6 GiB ± a bit — onset of variance: max bandwidth still 9.7 GB/s, but p50 collapses to ~3.4 GB/s. Some runs achieve full speed; others take 2–3 × longer.
12 GiB — wide spread (min 1.5 s, max 10 s for the same 11.5 GiB transfer). This is the agentic-p99 size region.

The bandwidth ceiling of ~10 GB/s is well below H20's NVLink p2p (claimed ~900 GB/s in IB) — likely the transfer is PCIe-staged through host memory rather than NVLink direct. To confirm we would need nvidia-smi topo -m and mooncake_transfer_engine_topology_dump analysis; not done yet.

Known limitations of this measurement

kv_both, not strict PD-disagg. vLLM 0.18.1 with kv_role=kv_consumer raises AttributeError: 'MooncakeConnectorWorker' object has no attribute 'bootstrap_server' (the attribute is only assigned inside if not self.is_kv_consumer). The transfer mechanics are identical — same batch_transfer_sync_write — so the cost measurement is comparable. The role gate only affects which request types each instance accepts. §5.2 strict PD-disagg baseline will need either to fix that bug or front the pair with a role-aware proxy.
Single in-flight request. All measurements here are serial. Real PD-disagg will have many concurrent transfers; bandwidth contention is not characterized.
Intra-node only. Inter-node RDMA path will be slower; not yet measured.
Sanity preamble events. The raw logs include 6 events from earlier sanity runs in addition to the 45-event sweep. analyze_mb2.py treats them as additional samples (same sizes); the per-size aggregates use all of them.

Implications for §3.2 PD-disagg argument

For each PD-disagg-routed request, transfer wall-time is:

T_transfer(KV_size) ≈ KV_size / 9.7 GB/s   for KV_size ≤ 3 GiB
                    ≈ 0.3 – 10 s            for KV_size in [3, 12] GiB

This is the per-request transfer charge of PD-disagg. It's a real cost, but in the context of phase-isolation accounting it is small compared to the benefit:

Prefill	T_prefill (MB1)	T_transfer (MB2)	Phase-isolation benefit at D=8 = D × T_prefill
2k tok (trace lower)	0.14 s	8 ms	1.1 s
33k tok (trace mean)	4.5 s	320 ms	36 s
125k tok (~p99)	57 s	1.9 s	456 s

On the phase-isolation axis alone, PD-disagg recovers two orders of magnitude more decode time than it pays in transfer. It is NOT this axis that defeats static PD-disagg in agentic — see colleague's 4P+4D experiment (TTFT p50 62×, success rate 99.5% → 52%) which is driven by D-side KV-pool overflow on long-context requests (figs/f4b_pdsep_kv_wall.png), not by transfer latency.

What MB2 contributes to the paper is therefore:

The per-request transfer cost number (used as the cost input to the cost-benefit accounting above).
The empirical observation that Mooncake's transfer cost is topology-independent — intra-node and inter-node both go through the RDMA NIC and hit the same 9.7 GB/s ceiling. PD-disagg's transfer cost does not get cheaper by co-locating P and D.

The dominant §3.2 failure mode of static PD-disagg in agentic is capacity, not transfer cost. MB3 / MB4 / MB5 will quantify the remaining axes (D-pool occupancy, cache reuse degradation under PD routing, static-partition mismatch).

Open questions / next runs

Inter-node RDMA: dash1 ↔ dash2. Expected lower bandwidth (~5–15 GB/s); want to see if the 6 GiB-onset variance moves.
Bandwidth ceiling investigation: is the 9.7 GB/s ceiling PCIe (so the connector is not using NVLink direct) or some internal limit? If PCIe, can it be lifted with NVLink-direct mooncake config?
Variance at 6+ GiB: investigate. Maybe related to chunking inside batch_transfer_sync_write, or GPU memory pressure when KV approaches HBM ceiling.
Concurrent transfers: measure aggregate bandwidth when N simultaneous transfers happen. PD-disagg in practice does this.
Strict kv_producer/kv_consumer: patch the bootstrap_server bug or use a proxy; verify transfer time is unchanged.

Reproduction

# On dash machine with cpfs mount + ssh access:
bash microbench/fresh_setup/install.sh        # once (idempotent)
bash microbench/fresh_setup/deploy.sh dash1    # push scripts to cpfs

# bring up pair (intra-node)
ssh dash1 'GPU_A=0 GPU_B=1 bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh start'

# sweep
ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb2_kv_transfer.py \
    --sizes 512,1024,2048,4096,8192,16384,32768,65536,131072 \
    --repeats 5 --label intra-kvboth \
    --out /home/admin/cpfs/wjh/agentic-kv-fresh/mb2_results/intra_kvboth.json'

# pull logs
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/A/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/A_intra_kvboth.jsonl
scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb2_transfer_logs/B/.efc_*_mb2_transfer_pid*.jsonl \
    analysis/mb2/B_intra_kvboth.jsonl

# analyze
.venv/bin/python microbench/fresh_setup/analyze_mb2.py \
  --a-log analysis/mb2/A_intra_kvboth.jsonl \
  --b-log analysis/mb2/B_intra_kvboth.jsonl \
  --out analysis/mb2/intra_kvboth_breakdown.json

.venv/bin/python microbench/fresh_setup/plot_mb2.py \
  --breakdown analysis/mb2/intra_kvboth_breakdown.json \
  --out-time figs/mb2_transfer_time_intra.png \
  --out-bw figs/mb2_transfer_bw_intra.png

# tear down
ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/start_vllm_pair.sh stop'

Run log

2026-05-27 — intra-node, kv_both, dash1 GPU 0+1

Sweep: 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072 tokens × 5 repeats. Sanity preamble of 512, 2048, 8192 × 2 included in the raw logs (counted as additional samples for those sizes).

Result table above. 9.7 GB/s steady-state up to 3 GiB, variance opens at 6 GiB, p99 agentic-tail transfer 1.5 – 10 s.

Committed as de164e5.

2026-05-27 — inter-node, kv_both, dash1 GPU 0 → dash2 GPU 0

Same sweep config. 200 Gbps RoCE between hosts (RTT ~0.2 ms ping). Producer A on dash1 GPU 0, consumer B on dash2 GPU 0. remote_bootstrap_addr=http://172.27.123.142:8998 (dash1's internal IP).

Raw events: A_inter_kvboth.jsonl (45 send_blocks + 6 sanity). B's receive_kv events are missing for this run — the MB2_LOG_DIR env var did not propagate from the start-script through vLLM's EngineCore subprocess on dash2 (visible via cat /proc/$ENGINE_PID/environ shows empty for dash2 but contains MB2_LOG_DIR for dash1 — bookmark for future investigation, likely spawn-vs-fork difference in vLLM's multiproc executor across hosts). Pure-transfer numbers below come from A's send_blocks alone; full rx_total breakdown not available for this run.

Per-size pure-transfer (analyzed by analyze_mb2_send_only.py):

input_tokens	KV (MiB)	n	pure_ms p50	min	max	BW p50 (GB/s)	BW max
512	48	5	5.2	5.1	65.8	9.76	9.81
1024	96	5	10.2	10.1	10.4	9.91	10.00
2048	192	5	20.0	20.0	20.5	10.06	10.07
4096	384	5	40.1	40.1	40.5	10.04	10.05
8192	768	5	80.9	80.7	82.5	9.96	9.98
16384	1536	5	161.8	161.7	164.8	9.96	9.96
32768	3072	5	309.6	307.7	526.9	10.40	10.47
65536	6144	5	1733.6	653.5	1921.2	3.72	9.86
131072	12288	5	2818.4	1283.0	9158.6	4.57	10.04

Side-by-side comparison with the 2026-05-27 intra-node run:

Size	intra p50 ms	inter p50 ms	gap	intra GB/s	inter GB/s
512	5.3	5.2	−2%	9.40	9.76
1024	10.4	10.2	−2%	9.68	9.91
2048	20.6	20.0	−3%	9.75	10.06
4096	41.5	40.1	−3%	9.71	10.04
8192	83.7	80.9	−3%	9.62	9.96
16384	167.1	161.8	−3%	9.64	9.96
32768	320.9	309.6	−3%	10.04	10.40
65536	1895.1	1733.6	−9%	3.40	3.72
131072	2835.1	2818.4	−1%	4.54	4.57

The two paths produce essentially the same numbers — mooncake intra- node is not using NVLink, it's going through RDMA-loopback on the local NIC and gets the same ~10 GB/s ceiling as cross-node RDMA. The 6+ GiB variance regime is also identical between paths.

Figures: figs/mb2_transfer_time_inter.png, figs/mb2_transfer_bw_inter.png, figs/mb2_transfer_time_compare.png (overlay), figs/mb2_transfer_bw_compare.png.

This collapses the §3.2 narrative to a single number: PD-disagg across this cluster costs ~9.7–10 GB/s of transfer bandwidth no matter how you place P and D (within-node or across-node). For p99 agentic KV (11.5 GiB), that's 1.3–10 s of transfer; for 6 GiB it's 0.7–2 s. Decode is 50–200 ms. So PD-disagg's cost dominates regardless of layout.

15 KiB Raw Blame History Unescape Escape