1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mb7 with background decode load (8/instance). Critical-path transfer overhead
stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at
32k), prefill not slowed, KV correct. Confirms the overlap holds on busy
instances. DESIGN.md updated with idle-vs-load table + the two blockers
(chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements per-layer KV push during prefill (write mode) on vLLM's
MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench
(mb7) shows correctness (KV lands, cached==prompt) and that the transfer is
hidden behind prefill compute: critical-path overhead drops from O(KV size)
(123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill
slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled,
single concurrent transfer — see DESIGN.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>