agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	0500350849	Fix hash mismatch: token-based lookup instead of cross-instance hash matching Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32)) when PYTHONHASHSEED is not set. All block hashes are chained from NONE_HASH, so D's hashes never match C's hashes. Fix: C's bootstrap server now accepts token_ids and does the prefix cache lookup locally using C's own hash function and block pool. No cross-instance hash matching needed. New flow: D sends prompt token_ids → C computes hashes on C's side → C looks up in C's own BlockPool → returns block_ids. Also: module-level _shared_block_pool for scheduler→bootstrap bridge, prompt_token_ids passed through PullReqMeta, test script added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:14:33 +08:00
Gahow Wang	a1f30e5fce	Add hash_table_sync logging + gap analysis Root cause of 0 cache hits on offloaded requests identified: - Hash table sync IS working (scheduler→metadata→worker→bootstrap) - But D's query_blocks returns no matches → hash format mismatch between D's request.block_hashes and C's synced hashes The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because D does FULL cold prefill (cache_hit=0), not partial prefill with RDMA-read cached blocks. Next: debug hash format mismatch between D and C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 00:38:14 +08:00
Gahow Wang	29b901b145	Fix scheduler assertion crash on partial remote prefill finished_recving The assertion `assert RequestStatus.is_finished(req.status)` at scheduler.py:2109 fires when a partial-remote-prefill request receives `finished_recving` while in RUNNING state (local prefill already started before RDMA read completed). This was the root cause of 67% error rate: EngineCore crashed with "fatal error" assertion, killing the vLLM instance. Fix: Replace assertion with debug log for non-WAITING, non-finished requests. kv_both no-offload baseline confirmed 0 errors, proving the crash was from our scheduler patch, not kv_both instability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 23:33:26 +08:00
Gahow Wang	23788f7cd5	Fix: import field from dataclasses for PullReqMeta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:29:24 +08:00
Gahow Wang	a7df84bd3b	Direct RDMA read: D reads cached KV from C's GPU without C's scheduler Complete implementation of direct RDMA read for KV cache migration: vLLM Mooncake connector (mooncake_connector.py): - PullReqMeta: add direct_read flag + block_hashes - MooncakeConnectorMetadata: add hash_table_updates/removals for scheduler->worker block hash sync - MooncakeConnectorScheduler: set_block_pool() to access BlockPool, build_connector_meta() computes hash table deltas each step, update_state_after_alloc() captures request block hashes for direct_read - MooncakeConnectorWorker: _start_direct_read() + _direct_read_single() implements D-side RDMA read via batch_transfer_sync_read, with HTTP query/unpin to C's bootstrap server Bootstrap server (mooncake_utils.py): - POST /query_blocks: look up block hashes, return block_ids + GPU layout - POST /unpin_blocks: release pin tracking - set_worker_kv_info(): register GPU addresses at init - update_hash_table(): receive scheduler deltas each step Scheduler (scheduler.py): - One-line hookup: pass block_pool to connector after KVCacheManager init Proxy (cache_aware_proxy.py): - _handle_direct_read_offload: sends request ONLY to D with direct_read=True + remote_bootstrap_addr. No request to C at all. - C's scheduler is completely uninvolved (0 GPU time on C) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:13 +08:00
Gahow Wang	ea5149726c	Partial remote prefill: C_s exports cache, D computes new tokens locally vLLM Mooncake patch: - get_num_new_matched_tokens: support remote_num_tokens parameter for partial remote prefill (pull N tokens from remote, compute rest locally) - update_state_after_alloc: only allocate receive blocks for external portion Proxy _handle_heavy_offload rewrite: - Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute) - Step 2: D pulls cached blocks + does local prefill for new tokens + decodes - C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt This enables true session migration: C_s releases cache, D takes over. C_s's GPU is freed immediately (no compute), vs old approach where C_s had to do full prefill (1-15s GPU occupancy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 20:04:13 +08:00
Gahow Wang	3bc37cc6d5	PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 02:14:37 +08:00
Gahow Wang	445e491123	Add vLLM v0.18.1 source tree with KV transfer abort fix third_party/vllm/ now tracked in git for direct patch management. Based on vLLM v0.18.1 release with one patch applied: vllm/v1/core/sched/scheduler.py: Replace fatal assert with graceful skip when KV transfer callback arrives for an already-aborted request during PD disaggregated serving. Future vLLM modifications should be made directly in third_party/vllm/ and committed normally. The patches/ directory is kept as documentation of what changed from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:30:38 +08:00

8 Commits