Files
agentic-kvc/microbench/connector_tax/cache_sweep/MIGRATION_TRANSFER_COST.md
Gahow Wang 1262c9c22e Migration transfer-cost study: KV transfer is slow on busy GPUs
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:53:01 +08:00

8.4 KiB
Raw Blame History

Why KV-transfer is slow during migration under real load

Question. EAR's unified+A+B routing beats migration (v3) on agentic workloads. We wanted to know whether layerwise KV transfer would shrink migration's overhead enough to make it viable. Investigating that led to a sharper question: in a real (loaded) cluster, when we migrate, the KV transfer is already slow — the effective bandwidth is far below the ~10 GB/s wire rate. Why?

This doc answers that with instrumented measurements.

TL;DR. Migration fires precisely when instances are busy (that's the trigger). But on a busy instance, KV transfer runs at ~3 GB/s instead of ~10 GB/s, because:

  1. The RDMA write itself slows ~2× under compute load — GPU-direct RDMA (batch_transfer_sync_write) contends with the running attention/MLP kernels for HBM and PCIe bandwidth. (idle 7.6 GB/s → busy 4.0 GB/s)
  2. The connector's Python control plane gets GIL-starved — mooncake's ZMQ handshake + transfer orchestration run on asyncio threads inside the engine process; when the engine's main thread is doing a long forward pass (e.g. a 100k-token prefill), those threads stall for seconds.

Both are inherent to upstream vLLM 0.18.1 + mooncake (reproduced on a clean fresh venv; the transfer path is byte-identical to upstream — our patches did not cause this), and both get worse, not better, with layerwise transfer. So the bandwidth gap is not a layerwise problem; it is a transfer-on-a-busy-GPU problem.


1. Evidence chain

Three independent measurements, all on dash0 (8×H100, Qwen3-Coder-30B-A3B, TP=1), Mooncake kv_both.

1a. Instrumented v3 trace replay — where does migration time go?

Run outputs/b3_v3_fullbreak_20260528_0338/. Instruments: instrument_dst_migration.py (dst scheduler lifecycle) + instrument_mooncake.py (connector internals: send_blocks RDMA, receive_kv window, ready_wait).

25 migrations fired over the trace. Dst-side migration overhead (T_kv_pull = scheduler marks WAITING_FOR_REMOTE_KVSfinished_recving):

component share what it is
RDMA-actual (batch_transfer_sync_write) 55% (55.2 s) the real RDMA write
dst control-plane gap 45% (45.4 s) scheduler↔receiver_loop dispatch + completion propagation
ready_wait (src KV not committed) 0% 25/25 already committed — ruled out
  • Pure RDMA aggregate rate: 2.03 GB/s (vs MB2 idle 9.7 GB/s).
  • RDMA rate collapses with transfer size: <3 GiB → 49.5 GB/s,

    5 GiB → 0.92.6 GB/s.

  • The control-plane gap is bimodal: median 0.04 s, but a handful of requests stall ~10 s. Those are small-KV transfers (0.18 s of actual RDMA) whose T_kv_pull is 811 s — i.e. the dst's receiver_loop thread was GIL-starved for ~10 s while the engine did a big forward pass.

Earlier (pre-instrumentation) we wrongly attributed ~90% of migration overhead to "dst scheduler queueing" by estimating transfer at clean wire speed. With real instrumentation, dst scheduler admission is ~0 (T_admission_post_kv = 0.003 s); the time is the transfer phase (RDMA + connector control plane), both degraded by instance busy-ness.

1b. MB6 controlled microbench — does busy-ness cause it?

microbench/fresh_setup/mb6_transfer_under_load.py + run_mb6.sh: 2 instances, transfer a fixed-size KV (prefill on A → migrate to B) while holding N background decode streams on both. Sweep N.

Effective transfer bandwidth (65k-token KV ≈ 6 GiB), main venv:

background load 65k transfer eff bandwidth
0 (idle) 747 ms 8.76 GB/s
8 (4/instance) 2423 ms 4.53 GB/s
24 (12/instance) 2015 ms 3.33 GB/s

Monotonic degradation with load. The busy level (3.3 GB/s) matches the v3 trace's 3.3 GB/s median exactly — because agentic instances run ~10+ concurrent requests, i.e. the bg=24 regime.

Decomposing the 65k transfer into RDMA-actual vs control-plane:

bg RDMA rate control-plane share
0 (idle) 7.56 GB/s 13%
8 4.07 GB/s 11%
24 (busy) 3.97 GB/s 15%

In the clean microbench the RDMA write itself is the dominant degrading term (7.6 → 4.0 GB/s). The ~10 s control-plane stalls seen in the trace (1a) don't reproduce here because steady decode forward passes are short; they require the long (100k-token) prefills that the real trace has.

1c. Fresh-venv comparison — is it our patch?

Same MB6 sweep on agentic-kv-fresh/.venv (clean upstream-style 0.18.1):

bg 65k eff (fresh) 65k eff (main/patched)
0 8.73 GB/s 8.76 GB/s
8 4.52 GB/s 4.53 GB/s
24 3.27 GB/s 3.33 GB/s

Identical within noise. Plus a static check: the v3 transfer path (send_kv_to_decode, _send_blocks/batch_transfer_sync_write, _build_transfer_params) is byte-identical to pristine upstream 0.18.1 (commit 445e491); receive_kv_from_single_worker differs only by a 4-line error branch. Our mooncake commits (a7df84b direct-read, ea51497 partial-prefill, e3a1d70 read→push) only touch a separate direct_read path that v3 does not use (v3 requests carry no direct_read flag → normal push path).

The slowdown is upstream/hardware-inherent, not introduced by us.


2. Root cause

Migration in agentic serving transfers KV between instances that are concurrently busy with compute — by construction, since v3 migrates away from a busy host. On a busy instance:

  • HBM/PCIe contention (the dominant, irreducible part). Mooncake's transfer is GPU-direct RDMA: the NIC DMAs KV straight out of / into GPU HBM. While the GPU runs attention+MLP kernels, those kernels saturate HBM bandwidth, so the NIC's RDMA gets a smaller slice. Effective transfer bandwidth roughly halves (7.6 → 4.0 GB/s at our load), and degrades further for large multi-segment transfers.
  • Control-plane GIL starvation (secondary, bursty). The connector runs its ZMQ handshake + send_kv_to_decode/receive_kv orchestration on asyncio threads (sender_loop/receiver_loop) inside the engine process. A long forward pass (100k-token prefill) holds the GIL for seconds, stalling those threads → multi-second dispatch gaps even when the actual transfer is 0.2 s.

MB2 measured 9.7 GB/s precisely because both endpoints were idle. The real-workload gap is the difference between "idle benchmark" and "transfer while the GPU is doing the day job."


3. Implication: layerwise is the wrong lever; migration's tax is largely irreducible

lever effect on the gap
Model-level layerwise transfer (push each layer's KV during prefill) Worse. Prefill is the most HBM-intensive phase, so per-layer transfers contend harder for HBM (Cause 1); and they multiply the control-plane round-trips (Cause 2).
Control-plane fix (move mooncake orchestration off the GIL-contended threads / separate process) Addresses only the bursty ~10 s stalls (~15% in the clean case, up to ~45% of the trace tail). Does not touch the HBM-contention half.
Reduce bytes (cache-aware target so less KV moves) Helps linearly; v3 Mechanism B already does some. Orthogonal.
Migrate to/from idle instances Would restore ~10 GB/s — but defeats the purpose (we migrate because the host is busy).

The dominant cost (RDMA contending with compute for HBM on busy instances) is a hardware reality, not a software bug we can patch away, and not something layerwise improves. This reinforces UNIFIED_ABLATION.md: the unified no-migration path (A+B'+RaceFix) remains the right default; migration's transfer tax is structural on a loaded agentic cluster.


4. Repro / artifacts

  • Instrumented v3 breakdown: outputs/b3_v3_fullbreak_20260528_0338/unified_v3/ (transfer_decomp.txt, dst_migration_breakdown.{csv,png}, transfer_rootcause.png)
  • MB6 main: outputs/mb6_agentic-kv_20260528_0552/mb6_result.json
  • MB6 fresh: outputs/mb6_fresh_20260528_0559/mb6_result.json
  • Instruments: microbench/fresh_setup/instrument_dst_migration.py, microbench/fresh_setup/instrument_mooncake.py
  • Microbench: microbench/fresh_setup/mb6_transfer_under_load.py + run_mb6.sh (VENV=… bash run_mb6.sh)
  • Analyzers: analyze_dst_migration.py, analyze_transfer_decomp.py

All instruments apply/revert cleanly via --apply/--revert; both venvs were restored after the runs.