Files

Gahow Wang 1262c9c22e Migration transfer-cost study: KV transfer is slow on busy GPUs

MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 11:53:01 +08:00

8.4 KiB

Raw Blame History

Why KV-transfer is slow during migration under real load

Question. EAR's unified+A+B routing beats migration (v3) on agentic workloads. We wanted to know whether layerwise KV transfer would shrink migration's overhead enough to make it viable. Investigating that led to a sharper question: in a real (loaded) cluster, when we migrate, the KV transfer is already slow — the effective bandwidth is far below the ~10 GB/s wire rate. Why?

This doc answers that with instrumented measurements.

TL;DR. Migration fires precisely when instances are busy (that's the trigger). But on a busy instance, KV transfer runs at ~3 GB/s instead of ~10 GB/s, because:

The RDMA write itself slows ~2× under compute load — GPU-direct RDMA (batch_transfer_sync_write) contends with the running attention/MLP kernels for HBM and PCIe bandwidth. (idle 7.6 GB/s → busy 4.0 GB/s)
The connector's Python control plane gets GIL-starved — mooncake's ZMQ handshake + transfer orchestration run on asyncio threads inside the engine process; when the engine's main thread is doing a long forward pass (e.g. a 100k-token prefill), those threads stall for seconds.

Both are inherent to upstream vLLM 0.18.1 + mooncake (reproduced on a clean fresh venv; the transfer path is byte-identical to upstream — our patches did not cause this), and both get worse, not better, with layerwise transfer. So the bandwidth gap is not a layerwise problem; it is a transfer-on-a-busy-GPU problem.

1. Evidence chain

Three independent measurements, all on dash0 (8×H100, Qwen3-Coder-30B-A3B, TP=1), Mooncake kv_both.

1a. Instrumented v3 trace replay — where does migration time go?

Run outputs/b3_v3_fullbreak_20260528_0338/. Instruments: instrument_dst_migration.py (dst scheduler lifecycle) + instrument_mooncake.py (connector internals: send_blocks RDMA, receive_kv window, ready_wait).

25 migrations fired over the trace. Dst-side migration overhead (T_kv_pull = scheduler marks WAITING_FOR_REMOTE_KVS → finished_recving):

component	share	what it is
RDMA-actual (`batch_transfer_sync_write`)	55% (55.2 s)	the real RDMA write
dst control-plane gap	45% (45.4 s)	scheduler↔receiver_loop dispatch + completion propagation
`ready_wait` (src KV not committed)	0%	25/25 already committed — ruled out

Pure RDMA aggregate rate: 2.03 GB/s (vs MB2 idle 9.7 GB/s).
RDMA rate collapses with transfer size: <3 GiB → 4–9.5 GB/s,

5 GiB → 0.9–2.6 GB/s.
The control-plane gap is bimodal: median 0.04 s, but a handful of requests stall ~10 s. Those are small-KV transfers (0.18 s of actual RDMA) whose T_kv_pull is 8–11 s — i.e. the dst's receiver_loop thread was GIL-starved for ~10 s while the engine did a big forward pass.

Earlier (pre-instrumentation) we wrongly attributed ~90% of migration overhead to "dst scheduler queueing" by estimating transfer at clean wire speed. With real instrumentation, dst scheduler admission is ~0 (T_admission_post_kv = 0.003 s); the time is the transfer phase (RDMA + connector control plane), both degraded by instance busy-ness.

1b. MB6 controlled microbench — does busy-ness cause it?

microbench/fresh_setup/mb6_transfer_under_load.py + run_mb6.sh: 2 instances, transfer a fixed-size KV (prefill on A → migrate to B) while holding N background decode streams on both. Sweep N.

Effective transfer bandwidth (65k-token KV ≈ 6 GiB), main venv:

background load	65k transfer	eff bandwidth
0 (idle)	747 ms	8.76 GB/s
8 (4/instance)	2423 ms	4.53 GB/s
24 (12/instance)	2015 ms	3.33 GB/s

Monotonic degradation with load. The busy level (3.3 GB/s) matches the v3 trace's 3.3 GB/s median exactly — because agentic instances run ~10+ concurrent requests, i.e. the bg=24 regime.

Decomposing the 65k transfer into RDMA-actual vs control-plane:

bg	RDMA rate	control-plane share
0 (idle)	7.56 GB/s	13%
8	4.07 GB/s	11%
24 (busy)	3.97 GB/s	15%

In the clean microbench the RDMA write itself is the dominant degrading term (7.6 → 4.0 GB/s). The ~10 s control-plane stalls seen in the trace (1a) don't reproduce here because steady decode forward passes are short; they require the long (100k-token) prefills that the real trace has.

1c. Fresh-venv comparison — is it our patch?

Same MB6 sweep on agentic-kv-fresh/.venv (clean upstream-style 0.18.1):

bg	65k eff (fresh)	65k eff (main/patched)
0	8.73 GB/s	8.76 GB/s
8	4.52 GB/s	4.53 GB/s
24	3.27 GB/s	3.33 GB/s

Identical within noise. Plus a static check: the v3 transfer path (send_kv_to_decode, _send_blocks/batch_transfer_sync_write, _build_transfer_params) is byte-identical to pristine upstream 0.18.1 (commit 445e491); receive_kv_from_single_worker differs only by a 4-line error branch. Our mooncake commits (a7df84b direct-read, ea51497 partial-prefill, e3a1d70 read→push) only touch a separate direct_read path that v3 does not use (v3 requests carry no direct_read flag → normal push path).

→ The slowdown is upstream/hardware-inherent, not introduced by us.

2. Root cause

Migration in agentic serving transfers KV between instances that are concurrently busy with compute — by construction, since v3 migrates away from a busy host. On a busy instance:

HBM/PCIe contention (the dominant, irreducible part). Mooncake's transfer is GPU-direct RDMA: the NIC DMAs KV straight out of / into GPU HBM. While the GPU runs attention+MLP kernels, those kernels saturate HBM bandwidth, so the NIC's RDMA gets a smaller slice. Effective transfer bandwidth roughly halves (7.6 → 4.0 GB/s at our load), and degrades further for large multi-segment transfers.
Control-plane GIL starvation (secondary, bursty). The connector runs its ZMQ handshake + send_kv_to_decode/receive_kv orchestration on asyncio threads (sender_loop/receiver_loop) inside the engine process. A long forward pass (100k-token prefill) holds the GIL for seconds, stalling those threads → multi-second dispatch gaps even when the actual transfer is 0.2 s.

MB2 measured 9.7 GB/s precisely because both endpoints were idle. The real-workload gap is the difference between "idle benchmark" and "transfer while the GPU is doing the day job."

3. Implication: layerwise is the wrong lever; migration's tax is largely irreducible

lever	effect on the gap
Model-level layerwise transfer (push each layer's KV during prefill)	Worse. Prefill is the most HBM-intensive phase, so per-layer transfers contend harder for HBM (Cause 1); and they multiply the control-plane round-trips (Cause 2).
Control-plane fix (move mooncake orchestration off the GIL-contended threads / separate process)	Addresses only the bursty ~10 s stalls (~15% in the clean case, up to ~45% of the trace tail). Does not touch the HBM-contention half.
Reduce bytes (cache-aware target so less KV moves)	Helps linearly; v3 Mechanism B already does some. Orthogonal.
Migrate to/from idle instances	Would restore ~10 GB/s — but defeats the purpose (we migrate because the host is busy).

The dominant cost (RDMA contending with compute for HBM on busy instances) is a hardware reality, not a software bug we can patch away, and not something layerwise improves. This reinforces UNIFIED_ABLATION.md: the unified no-migration path (A+B'+RaceFix) remains the right default; migration's transfer tax is structural on a loaded agentic cluster.

4. Repro / artifacts

Instrumented v3 breakdown: outputs/b3_v3_fullbreak_20260528_0338/unified_v3/ (transfer_decomp.txt, dst_migration_breakdown.{csv,png}, transfer_rootcause.png)
MB6 main: outputs/mb6_agentic-kv_20260528_0552/mb6_result.json
MB6 fresh: outputs/mb6_fresh_20260528_0559/mb6_result.json
Instruments: microbench/fresh_setup/instrument_dst_migration.py, microbench/fresh_setup/instrument_mooncake.py
Microbench: microbench/fresh_setup/mb6_transfer_under_load.py + run_mb6.sh (VENV=… bash run_mb6.sh)
Analyzers: analyze_dst_migration.py, analyze_transfer_decomp.py

All instruments apply/revert cleanly via --apply/--revert; both venvs were restored after the runs.

8.4 KiB Raw Blame History Unescape Escape