agentic-kvc

Go to file

Gahow Wang ea5149726c Partial remote prefill: C_s exports cache, D computes new tokens locally

vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
  partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion

Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt

This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-23 20:04:13 +08:00

analysis

Add comprehensive research findings document

2026-05-23 07:16:31 +08:00

experiments

Add elastic PS evaluation plan for production-realistic trace

2026-05-23 15:56:05 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Fix replay methodology: trace-driven dispatch, no artificial limits