vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion
Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt
This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>