MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at ~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into ~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s) + ~45% control-plane GIL starvation during long prefills. Reproduced on a fresh upstream venv (byte-identical transfer path) -> upstream/hardware inherent, not our patch. Layerwise is the wrong lever; the tax is structural on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6, instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.4 KiB
Why KV-transfer is slow during migration under real load
Question. EAR's unified+A+B routing beats migration (v3) on agentic workloads. We wanted to know whether layerwise KV transfer would shrink migration's overhead enough to make it viable. Investigating that led to a sharper question: in a real (loaded) cluster, when we migrate, the KV transfer is already slow — the effective bandwidth is far below the ~10 GB/s wire rate. Why?
This doc answers that with instrumented measurements.
TL;DR. Migration fires precisely when instances are busy (that's the trigger). But on a busy instance, KV transfer runs at ~3 GB/s instead of ~10 GB/s, because:
- The RDMA write itself slows ~2× under compute load — GPU-direct RDMA
(
batch_transfer_sync_write) contends with the running attention/MLP kernels for HBM and PCIe bandwidth. (idle 7.6 GB/s → busy 4.0 GB/s) - The connector's Python control plane gets GIL-starved — mooncake's ZMQ handshake + transfer orchestration run on asyncio threads inside the engine process; when the engine's main thread is doing a long forward pass (e.g. a 100k-token prefill), those threads stall for seconds.
Both are inherent to upstream vLLM 0.18.1 + mooncake (reproduced on a clean fresh venv; the transfer path is byte-identical to upstream — our patches did not cause this), and both get worse, not better, with layerwise transfer. So the bandwidth gap is not a layerwise problem; it is a transfer-on-a-busy-GPU problem.
1. Evidence chain
Three independent measurements, all on dash0 (8×H100, Qwen3-Coder-30B-A3B,
TP=1), Mooncake kv_both.
1a. Instrumented v3 trace replay — where does migration time go?
Run outputs/b3_v3_fullbreak_20260528_0338/. Instruments:
instrument_dst_migration.py (dst scheduler lifecycle) +
instrument_mooncake.py (connector internals: send_blocks RDMA,
receive_kv window, ready_wait).
25 migrations fired over the trace. Dst-side migration overhead
(T_kv_pull = scheduler marks WAITING_FOR_REMOTE_KVS → finished_recving):
| component | share | what it is |
|---|---|---|
RDMA-actual (batch_transfer_sync_write) |
55% (55.2 s) | the real RDMA write |
| dst control-plane gap | 45% (45.4 s) | scheduler↔receiver_loop dispatch + completion propagation |
ready_wait (src KV not committed) |
0% | 25/25 already committed — ruled out |
- Pure RDMA aggregate rate: 2.03 GB/s (vs MB2 idle 9.7 GB/s).
- RDMA rate collapses with transfer size: <3 GiB → 4–9.5 GB/s,
5 GiB → 0.9–2.6 GB/s.
- The control-plane gap is bimodal: median 0.04 s, but a handful of
requests stall ~10 s. Those are small-KV transfers (0.18 s of actual RDMA)
whose
T_kv_pullis 8–11 s — i.e. the dst'sreceiver_loopthread was GIL-starved for ~10 s while the engine did a big forward pass.
Earlier (pre-instrumentation) we wrongly attributed ~90% of migration overhead to "dst scheduler queueing" by estimating transfer at clean wire speed. With real instrumentation, dst scheduler admission is ~0 (
T_admission_post_kv= 0.003 s); the time is the transfer phase (RDMA + connector control plane), both degraded by instance busy-ness.
1b. MB6 controlled microbench — does busy-ness cause it?
microbench/fresh_setup/mb6_transfer_under_load.py + run_mb6.sh: 2
instances, transfer a fixed-size KV (prefill on A → migrate to B) while
holding N background decode streams on both. Sweep N.
Effective transfer bandwidth (65k-token KV ≈ 6 GiB), main venv:
| background load | 65k transfer | eff bandwidth |
|---|---|---|
| 0 (idle) | 747 ms | 8.76 GB/s |
| 8 (4/instance) | 2423 ms | 4.53 GB/s |
| 24 (12/instance) | 2015 ms | 3.33 GB/s |
Monotonic degradation with load. The busy level (3.3 GB/s) matches the v3 trace's 3.3 GB/s median exactly — because agentic instances run ~10+ concurrent requests, i.e. the bg=24 regime.
Decomposing the 65k transfer into RDMA-actual vs control-plane:
| bg | RDMA rate | control-plane share |
|---|---|---|
| 0 (idle) | 7.56 GB/s | 13% |
| 8 | 4.07 GB/s | 11% |
| 24 (busy) | 3.97 GB/s | 15% |
In the clean microbench the RDMA write itself is the dominant degrading term (7.6 → 4.0 GB/s). The ~10 s control-plane stalls seen in the trace (1a) don't reproduce here because steady decode forward passes are short; they require the long (100k-token) prefills that the real trace has.
1c. Fresh-venv comparison — is it our patch?
Same MB6 sweep on agentic-kv-fresh/.venv (clean upstream-style 0.18.1):
| bg | 65k eff (fresh) | 65k eff (main/patched) |
|---|---|---|
| 0 | 8.73 GB/s | 8.76 GB/s |
| 8 | 4.52 GB/s | 4.53 GB/s |
| 24 | 3.27 GB/s | 3.33 GB/s |
Identical within noise. Plus a static check: the v3 transfer path
(send_kv_to_decode, _send_blocks/batch_transfer_sync_write,
_build_transfer_params) is byte-identical to pristine upstream 0.18.1
(commit 445e491); receive_kv_from_single_worker differs only by a 4-line
error branch. Our mooncake commits (a7df84b direct-read,
ea51497 partial-prefill, e3a1d70 read→push) only touch a separate
direct_read path that v3 does not use (v3 requests carry no
direct_read flag → normal push path).
→ The slowdown is upstream/hardware-inherent, not introduced by us.
2. Root cause
Migration in agentic serving transfers KV between instances that are concurrently busy with compute — by construction, since v3 migrates away from a busy host. On a busy instance:
- HBM/PCIe contention (the dominant, irreducible part). Mooncake's transfer is GPU-direct RDMA: the NIC DMAs KV straight out of / into GPU HBM. While the GPU runs attention+MLP kernels, those kernels saturate HBM bandwidth, so the NIC's RDMA gets a smaller slice. Effective transfer bandwidth roughly halves (7.6 → 4.0 GB/s at our load), and degrades further for large multi-segment transfers.
- Control-plane GIL starvation (secondary, bursty). The connector runs
its ZMQ handshake +
send_kv_to_decode/receive_kvorchestration on asyncio threads (sender_loop/receiver_loop) inside the engine process. A long forward pass (100k-token prefill) holds the GIL for seconds, stalling those threads → multi-second dispatch gaps even when the actual transfer is 0.2 s.
MB2 measured 9.7 GB/s precisely because both endpoints were idle. The real-workload gap is the difference between "idle benchmark" and "transfer while the GPU is doing the day job."
3. Implication: layerwise is the wrong lever; migration's tax is largely irreducible
| lever | effect on the gap |
|---|---|
| Model-level layerwise transfer (push each layer's KV during prefill) | Worse. Prefill is the most HBM-intensive phase, so per-layer transfers contend harder for HBM (Cause 1); and they multiply the control-plane round-trips (Cause 2). |
| Control-plane fix (move mooncake orchestration off the GIL-contended threads / separate process) | Addresses only the bursty ~10 s stalls (~15% in the clean case, up to ~45% of the trace tail). Does not touch the HBM-contention half. |
| Reduce bytes (cache-aware target so less KV moves) | Helps linearly; v3 Mechanism B already does some. Orthogonal. |
| Migrate to/from idle instances | Would restore ~10 GB/s — but defeats the purpose (we migrate because the host is busy). |
The dominant cost (RDMA contending with compute for HBM on busy instances) is a hardware reality, not a software bug we can patch away, and not something layerwise improves. This reinforces UNIFIED_ABLATION.md: the unified no-migration path (A+B'+RaceFix) remains the right default; migration's transfer tax is structural on a loaded agentic cluster.
4. Repro / artifacts
- Instrumented v3 breakdown:
outputs/b3_v3_fullbreak_20260528_0338/unified_v3/(transfer_decomp.txt,dst_migration_breakdown.{csv,png},transfer_rootcause.png) - MB6 main:
outputs/mb6_agentic-kv_20260528_0552/mb6_result.json - MB6 fresh:
outputs/mb6_fresh_20260528_0559/mb6_result.json - Instruments:
microbench/fresh_setup/instrument_dst_migration.py,microbench/fresh_setup/instrument_mooncake.py - Microbench:
microbench/fresh_setup/mb6_transfer_under_load.py+run_mb6.sh(VENV=… bash run_mb6.sh) - Analyzers:
analyze_dst_migration.py,analyze_transfer_decomp.py
All instruments apply/revert cleanly via --apply/--revert; both venvs
were restored after the runs.