1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s, scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99 -5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms layer-wise removes the transfer half of migration overhead but not the control-plane/queue residual. DESIGN.md updated with results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.9 KiB
Layer-wise KV transfer on Mooncake — exploration
Goal: make vLLM's MooncakeConnector push KV per-layer during prefill
(write mode) instead of the current post-hoc full-request transfer, then
microbench correctness + whether it hides the transfer behind prefill compute
(the thing MoRIIO's write mode does on AMD; no NVIDIA connector ships it).
Everything here is isolated in worktree worktree-mooncake-layerwise. The
dash0 venv connector is backed up at mooncake_connector.py.ORIG_BACKUP;
revert = copy the backup back. Opt-in via env MOONCAKE_LAYERWISE=1, so with
the env unset the connector behaves exactly as upstream.
Baseline flow (post-hoc, what we have)
- Proxy: prefill on src (
do_remote_decode, max_tokens=1) → await done → decode on dst (do_remote_prefill) which pulls. - dst
start_load_kv→receive_kvsends ZMQMooncakeXferMetadata(its block addrs) to src bootstrap. - src
send_kv_to_decode: waitssend_meta.ready(set atrequest_finished, i.e. after full prefill) →_build_transfer_params(all layers) →_send_blocks(one bigbatch_transfer_sync_write) → FINISH response.
Measured: this full transfer is on the critical path, runs at ~3 GB/s under load (vs ~10 GB/s idle), dominating migration TTFT.
Layer-wise flow (write mode, this exploration)
Key idea: keep all RDMA + completion on the sender_loop thread (clean), but
issue one batch_transfer_sync_write per layer, each fired as soon as that
layer's KV is computed — so writes overlap the remaining prefill compute.
Signaling: save_kv_layer(layer_name, ...) (called by vLLM's attention hook
after each layer's forward, on the main worker thread) records "layer L
computed" and wakes the sender_loop. send_kv_to_decode loops L=0..N-1,
waits until L is computed, writes layer L's blocks, then sends FINISH.
Edits to mooncake_connector.py (all gated by _lw_enabled)
- Worker
__init__:_lw_enabled(env), layer-name→position map,_lw_computed: dict[transfer_id,int],_lw_active: set[transfer_id], wake event, lock. register_kv_caches: build_lw_layer_pos[layer_name](0..N-1) and_lw_addr_idx[pos]= indices intokv_caches_base_addr(×2 ifsplit_k_and_v).- Scheduler
update_state_after_alloc(do_remote_decodebranch): in layer-wise mode captureblocks.get_block_ids()[0]and store non-empty in_reqs_need_sendso the worker learns local block_ids + setsreadybefore prefill finishes. - Worker
note_layer_computed(layer_name)(new) called fromMooncakeConnector.save_kv_layer: bump_lw_computed[tid]for active producers,call_soon_threadsafe(wake.set). - Worker
send_kv_to_decode: in layer-wise mode, mark transfer active, loop layers: await_lw_computed[tid] >= L,_send_blocksfor layer L only (subset of_build_transfer_params), then send FINISH. - Worker
_build_layer_transfer_params(new): like_build_transfer_paramsbut only the addr indices for one layer position.
Microbench requirements
- Disable chunked prefill (
--max-num-batched-tokens≥ prompt) so prefill is a single forward andsave_kv_layerfires once per layer in order. - Dispatch the dst (
do_remote_prefill) request first/concurrently so the ZMQ handshake reaches src during prefill. - Correctness: dst follow-up
cached_tokens == prompt_len(KV landed), identical to baseline. - Perf: src prefill wall-clock (does layer-wise slow it?) and dst TTFT (does transfer leave the critical path?), swept over KV size, vs baseline.
Status
- worktree + connector backup + design
- modified connector (LAYERWISE.py, +193/-4 lines, env-gated)
- correctness microbench (mb7_layerwise.py) + launcher (run_mb7.sh)
- correctness run on dash0 — PASS (KV lands; cached == prompt)
- perf run + verdict — POSITIVE (transfer hidden behind prefill)
Results (2-instance, idle, chunked-prefill off, Qwen3-30B-A3B, 48 layers)
Metric: overhead = total − prefill_only = the transfer cost left on the
critical path (TTFT). Baseline = post-hoc full pull (sequential).
| KV size | baseline overhead | layerwise overhead | reduction |
|---|---|---|---|
| 8192 (0.75 GiB) | 123 ms | 58 ms | 2.1× |
| 16384 (1.5 GiB) | 202 ms | 58 ms | 3.5× |
| 32768 (3.0 GiB) | 529 ms | 57 ms | 9.3× |
Key signatures:
- Layerwise overhead is ~constant (~58 ms) regardless of KV size, while baseline grows O(KV size). The 58 ms is handshake + last-layer tail + 1 decode; the bulk transfer is hidden behind prefill compute.
- Prefill did NOT slow down: layerwise
t_A(575/1495/4440 ms) ==prefill_only(574/1492/4440 ms). The concurrent RDMA was "free" on idle GPUs — no measurable HBM contention with prefill compute here. - Producer logs confirm the transfer itself took 0.39/0.55/4.37 s (grows with size) yet ran inside the prefill window, so it left the critical path.
- Correctness PASS: B's follow-up cached == prompt for all sizes; the 48-layer / 96-base-addr (split K&V) per-layer addressing is correct.
Caveats (why this is a proof-of-concept, not a verdict for production)
- Idle instances only. Real migration happens between busy instances. Under load both prefill and transfer slow; transfer (even at ~3 GB/s) is still < prefill for big contexts so it should still hide, but receive-side (B) and HBM contention during prefill are untested here. NEXT: rerun with background load on both A and B.
- Chunked prefill disabled. The monotonic layer counter assumes one forward, layers in order. Production uses chunked prefill (multi-step), which needs per-(chunk,layer) tracking — not implemented.
- Single concurrent producer transfer. Global counter; real migration is concurrent. Would need per-transfer state.
- Microbench dispatch. mb7 fires B then A with a 50 ms head start to get
the handshake to A before its forward. The real proxy path
(
_handle_combined_pd_sep_v2) dispatches sequentially and would need the write-mode (concurrent) restructure.
Results under LOAD (bg=16 background decode streams, 8 per instance)
Critical-path transfer overhead (ms), total − prefill_only:
| KV size | idle base | idle LW | load base | load LW |
|---|---|---|---|---|
| 8k | 123 | 58 | 158 | 94 |
| 16k | 202 | 58 | 239 | 83 |
| 32k | 529 | 57 | 749 | 95 |
The overlap survives load: layerwise overhead stays ~constant (~90 ms)
under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did
not slow (load LW t_A == load prefill_only); the transfer (0.56/1.46/4.37 s,
producer logs) ran inside the prefill window even with 16 concurrent decodes.
Correctness PASS under load.
FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
Hardened connector (per-step incremental shipping, per-transfer state) +
write-mode proxy (concurrent prefill/decode dispatch). Two passes of
w600_r0.0015_st30.jsonl under unified_v3, differing only in transfer mode.
Correctness: layer-wise 1213/1214 success (1 connection-error on the 128k
req, not KV corruption); byte-level KV correctness validated on mb7
(chunked + 3-way concurrent, cached==prompt); producer logs confirm
incremental shipping (e.g. shipped 7872/7872 blocks).
Migration sets differ between runs (write-mode timing shifts which requests trigger migration; only 4 migrated in both), but are distributionally comparable (median new_local/input ≈ 0.42 vs 0.46). Matched migrations all improved, scaling with the transfer hidden behind prefill:
| request | input | new_local | base TTFT | LW TTFT | Δ |
|---|---|---|---|---|---|
| 1268630 | 102k | 97k | 41.20 | 33.96 | −7.23s |
| 1334223 | 37k | 14k | 6.04 | 3.23 | −2.81s |
| 1279412 | 40k | 8k | 5.50 | 2.92 | −2.58s |
| 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | −0.03s (queue-bound) |
Trace-level TTFT (different sets, directional): overall p90 9.79→9.16 (−6%), p99 44.89→42.85 (−5%). Modest because (a) migrations are only 25/1214 ≈ 2% of requests, and (b) several migrations are queue/contention-bound, not transfer-bound — layer-wise removes the transfer component but not the control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
Verdict on the trace re-profile: layer-wise does exactly what the profile predicted — it removes the transfer half of migration overhead (matched migrations −2.6 to −7.2s, biggest where there's the most prefill to hide behind), but the trace-level gain is small because migrations are rare and partly queue-bound. It does NOT, on its own, flip migration to a clear win over unified for this agentic workload.
Verdict (microbench)
The mechanism works and the benefit holds under load: layer-wise push turns migration's KV-transfer cost from O(KV size) on the critical path into a near-constant ~90 ms tail, by overlapping it with prefill compute — what MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.
BUT this is single-transfer, non-chunked. Running the actual 1200-req trace correctly needs two more pieces this PoC does NOT have:
- Chunk-safe tracking — long agentic prompts force chunked prefill;
save_kv_layerthen fires per-chunk and the monotonic counter would ship uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking. - Concurrent-transfer safety — the global counter assumes one producer at a time; the trace migrates from busy instances running other forwards.
Also: even with those fixed, layer-wise only removes the transfer half of
the measured migration overhead. The b3_v3_fullbreak profile showed dst-side
T_kv_pull = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise
hides the RDMA half but the control-plane half is orthogonal. So a trace
re-profile would show roughly the transfer half collapse, not the whole thing.