Files

Gahow Wang 63387f614d Full v3 trace re-profile with layer-wise: matched migrations improve

1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 19:16:37 +08:00

9.9 KiB

Raw Blame History

Layer-wise KV transfer on Mooncake — exploration

Goal: make vLLM's MooncakeConnector push KV per-layer during prefill (write mode) instead of the current post-hoc full-request transfer, then microbench correctness + whether it hides the transfer behind prefill compute (the thing MoRIIO's write mode does on AMD; no NVIDIA connector ships it).

Everything here is isolated in worktree worktree-mooncake-layerwise. The dash0 venv connector is backed up at mooncake_connector.py.ORIG_BACKUP; revert = copy the backup back. Opt-in via env MOONCAKE_LAYERWISE=1, so with the env unset the connector behaves exactly as upstream.

Baseline flow (post-hoc, what we have)

Proxy: prefill on src (do_remote_decode, max_tokens=1) → await done → decode on dst (do_remote_prefill) which pulls.
dst start_load_kv→receive_kv sends ZMQ MooncakeXferMetadata (its block addrs) to src bootstrap.
src send_kv_to_decode: waits send_meta.ready (set at request_finished, i.e. after full prefill) → _build_transfer_params (all layers) → _send_blocks (one big batch_transfer_sync_write) → FINISH response.

Measured: this full transfer is on the critical path, runs at ~3 GB/s under load (vs ~10 GB/s idle), dominating migration TTFT.

Layer-wise flow (write mode, this exploration)

Key idea: keep all RDMA + completion on the sender_loop thread (clean), but issue one batch_transfer_sync_write per layer, each fired as soon as that layer's KV is computed — so writes overlap the remaining prefill compute.

Signaling: save_kv_layer(layer_name, ...) (called by vLLM's attention hook after each layer's forward, on the main worker thread) records "layer L computed" and wakes the sender_loop. send_kv_to_decode loops L=0..N-1, waits until L is computed, writes layer L's blocks, then sends FINISH.

Edits to `mooncake_connector.py` (all gated by `_lw_enabled`)

Worker __init__: _lw_enabled (env), layer-name→position map, _lw_computed: dict[transfer_id,int], _lw_active: set[transfer_id], wake event, lock.
register_kv_caches: build _lw_layer_pos[layer_name] (0..N-1) and _lw_addr_idx[pos] = indices into kv_caches_base_addr (×2 if split_k_and_v).
Scheduler update_state_after_alloc (do_remote_decode branch): in layer-wise mode capture blocks.get_block_ids()[0] and store non-empty in _reqs_need_send so the worker learns local block_ids + sets ready before prefill finishes.
Worker note_layer_computed(layer_name) (new) called from MooncakeConnector.save_kv_layer: bump _lw_computed[tid] for active producers, call_soon_threadsafe(wake.set).
Worker send_kv_to_decode: in layer-wise mode, mark transfer active, loop layers: await _lw_computed[tid] >= L, _send_blocks for layer L only (subset of _build_transfer_params), then send FINISH.
Worker _build_layer_transfer_params (new): like _build_transfer_params but only the addr indices for one layer position.

Microbench requirements

Disable chunked prefill (--max-num-batched-tokens ≥ prompt) so prefill is a single forward and save_kv_layer fires once per layer in order.
Dispatch the dst (do_remote_prefill) request first/concurrently so the ZMQ handshake reaches src during prefill.
Correctness: dst follow-up cached_tokens == prompt_len (KV landed), identical to baseline.
Perf: src prefill wall-clock (does layer-wise slow it?) and dst TTFT (does transfer leave the critical path?), swept over KV size, vs baseline.

Status

worktree + connector backup + design
modified connector (LAYERWISE.py, +193/-4 lines, env-gated)
correctness microbench (mb7_layerwise.py) + launcher (run_mb7.sh)
correctness run on dash0 — PASS (KV lands; cached == prompt)
perf run + verdict — POSITIVE (transfer hidden behind prefill)

Results (2-instance, idle, chunked-prefill off, Qwen3-30B-A3B, 48 layers)

Metric: overhead = total − prefill_only = the transfer cost left on the critical path (TTFT). Baseline = post-hoc full pull (sequential).

KV size	baseline overhead	layerwise overhead	reduction
8192 (0.75 GiB)	123 ms	58 ms	2.1×
16384 (1.5 GiB)	202 ms	58 ms	3.5×
32768 (3.0 GiB)	529 ms	57 ms	9.3×

Key signatures:

Layerwise overhead is ~constant (~58 ms) regardless of KV size, while baseline grows O(KV size). The 58 ms is handshake + last-layer tail + 1 decode; the bulk transfer is hidden behind prefill compute.
Prefill did NOT slow down: layerwise t_A (575/1495/4440 ms) == prefill_only (574/1492/4440 ms). The concurrent RDMA was "free" on idle GPUs — no measurable HBM contention with prefill compute here.
Producer logs confirm the transfer itself took 0.39/0.55/4.37 s (grows with size) yet ran inside the prefill window, so it left the critical path.
Correctness PASS: B's follow-up cached == prompt for all sizes; the 48-layer / 96-base-addr (split K&V) per-layer addressing is correct.

Caveats (why this is a proof-of-concept, not a verdict for production)

Idle instances only. Real migration happens between busy instances. Under load both prefill and transfer slow; transfer (even at ~3 GB/s) is still < prefill for big contexts so it should still hide, but receive-side (B) and HBM contention during prefill are untested here. NEXT: rerun with background load on both A and B.
Chunked prefill disabled. The monotonic layer counter assumes one forward, layers in order. Production uses chunked prefill (multi-step), which needs per-(chunk,layer) tracking — not implemented.
Single concurrent producer transfer. Global counter; real migration is concurrent. Would need per-transfer state.
Microbench dispatch. mb7 fires B then A with a 50 ms head start to get the handshake to A before its forward. The real proxy path (_handle_combined_pd_sep_v2) dispatches sequentially and would need the write-mode (concurrent) restructure.

Results under LOAD (bg=16 background decode streams, 8 per instance)

Critical-path transfer overhead (ms), total − prefill_only:

KV size	idle base	idle LW	load base	load LW
8k	123	58	158	94
16k	202	58	239	83
32k	529	57	749	95

The overlap survives load: layerwise overhead stays ~constant (~90 ms) under load while baseline grows to 749 ms at 32k (7.9× reduction). Prefill did not slow (load LW t_A == load prefill_only); the transfer (0.56/1.46/4.37 s, producer logs) ran inside the prefill window even with 16 concurrent decodes. Correctness PASS under load.

FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)

Hardened connector (per-step incremental shipping, per-transfer state) + write-mode proxy (concurrent prefill/decode dispatch). Two passes of w600_r0.0015_st30.jsonl under unified_v3, differing only in transfer mode.

Correctness: layer-wise 1213/1214 success (1 connection-error on the 128k req, not KV corruption); byte-level KV correctness validated on mb7 (chunked + 3-way concurrent, cached==prompt); producer logs confirm incremental shipping (e.g. shipped 7872/7872 blocks).

Migration sets differ between runs (write-mode timing shifts which requests trigger migration; only 4 migrated in both), but are distributionally comparable (median new_local/input ≈ 0.42 vs 0.46). Matched migrations all improved, scaling with the transfer hidden behind prefill:

request	input	new_local	base TTFT	LW TTFT	Δ
1268630	102k	97k	41.20	33.96	−7.23s
1334223	37k	14k	6.04	3.23	−2.81s
1279412	40k	8k	5.50	2.92	−2.58s
1271459	8.9k	8.9k	37.01	36.98	−0.03s (queue-bound)

Trace-level TTFT (different sets, directional): overall p90 9.79→9.16 (−6%), p99 44.89→42.85 (−5%). Modest because (a) migrations are only 25/1214 ≈ 2% of requests, and (b) several migrations are queue/contention-bound, not transfer-bound — layer-wise removes the transfer component but not the control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).

Verdict on the trace re-profile: layer-wise does exactly what the profile predicted — it removes the transfer half of migration overhead (matched migrations −2.6 to −7.2s, biggest where there's the most prefill to hide behind), but the trace-level gain is small because migrations are rare and partly queue-bound. It does NOT, on its own, flip migration to a clear win over unified for this agentic workload.

Verdict (microbench)

The mechanism works and the benefit holds under load: layer-wise push turns migration's KV-transfer cost from O(KV size) on the critical path into a near-constant ~90 ms tail, by overlapping it with prefill compute — what MoRIIO's write mode does on AMD, now demonstrated on NVIDIA/Mooncake.

BUT this is single-transfer, non-chunked. Running the actual 1200-req trace correctly needs two more pieces this PoC does NOT have:

Chunk-safe tracking — long agentic prompts force chunked prefill; save_kv_layer then fires per-chunk and the monotonic counter would ship uncomputed blocks. Needs slot-mapping-aware per-(request,chunk) tracking.
Concurrent-transfer safety — the global counter assumes one producer at a time; the trace migrates from busy instances running other forwards.

Also: even with those fixed, layer-wise only removes the transfer half of the measured migration overhead. The b3_v3_fullbreak profile showed dst-side T_kv_pull = ~55% RDMA + ~45% control-plane GIL-dispatch stalls; layer-wise hides the RDMA half but the control-plane half is orthogonal. So a trace re-profile would show roughly the transfer half collapse, not the whole thing.

9.9 KiB Raw Blame History Unescape Escape