agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	5816aad731	A3: vLLM scheduler patch for step-level JSONL log When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line per scheduler step with t_unix, worker_id, prefill/decode token counts, n_running/n_waiting, preempted ids, and per-request phase labels. No-op when the env var is unset, so production engines are not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to each per-engine launch so step logs end up at engine_${i}.jsonl. Required by Batch 2 (PD-colo interference index) and Batch 5 (same-worker overlap attribution); engine /metrics polling cannot provide per-step granularity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:11 +08:00
Gahow Wang	e13391eeab	Evict migrated blocks from prefix cache after KV send completes After a session migrates from C to D via offload, C's blocks were freed to the LRU tail (most-recently-used position), making them the last to be evicted. Since the session won't return to C, these blocks are dead weight occupying cache capacity. Now capture block IDs before _free_blocks and call evict_blocks to remove them from the prefix cache hash table, so they can be reused sooner for active sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:56:34 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	cc4a9c91e7	Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash The standalone hash computation in estimate_hit produced different hashes than the hash_table (synced from scheduler). Root cause unclear (possibly pickle serialization differences or hash chain state). Fix: delegate to _lookup_by_tokens which is proven to work (push_blocks uses it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 12:41:53 +08:00
Gahow Wang	cdf83493ab	Fix A+C: real cache sync + cached-prefill-on-C architecture A: Add /estimate_hit endpoint to bootstrap server for real-time cache probing. Proxy queries this before committing to PUSH, eliminating 24% zero-match PUSH requests (shadow cache divergence). C: Add _handle_cached_prefill_offload: C (cache source) does fast cached prefill → KV to Mooncake → D pulls and decodes. Replaces broken direct_read PUSH where D waited for RDMA transfer while occupying KV blocks without doing compute. Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:22:38 +08:00
Gahow Wang	e3a1d70cf2	Switch from RDMA READ to bootstrap-triggered PUSH RDMA READ (batch_transfer_sync_read) fails on GPU memory because batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE. New approach: D sends /push_blocks to C's bootstrap with token_ids + D's GPU addresses. C's bootstrap: 1. Looks up matching blocks in synced hash table (640/640 verified) 2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks directly into D's GPU memory 3. Returns match count + push status C's scheduler is still NOT involved (0 GPU compute on C). The push uses C's worker thread + existing RDMA WRITE path (proven reliable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:47:49 +08:00
Gahow Wang	0bb6a67ed3	Fix: use sha256 (default) not sha256_cbor for block hash computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:36:05 +08:00
Gahow Wang	08d5e12838	Fix NONE_HASH import: use module ref instead of from-import (value binding bug) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:32:19 +08:00
Gahow Wang	7e91b83d88	Set PYTHONHASHSEED=42 for elastic mode to ensure consistent block hashes Root cause confirmed: NONE_HASH = os.urandom(32) differs between scheduler and bootstrap server even in the same process (init_none_hash called separately by each import path). PYTHONHASHSEED makes it deterministic: NONE_HASH = hash_fn(seed), same across all code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:27:52 +08:00
Gahow Wang	ee2301ae17	Fix: token lookup condition should check hash_table not block_pool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:21:49 +08:00
Gahow Wang	0c88609caa	Fix: use synced hash table + sha256_cbor for token-based lookup (same process NONE_HASH) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:18:47 +08:00
Gahow Wang	0500350849	Fix hash mismatch: token-based lookup instead of cross-instance hash matching Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32)) when PYTHONHASHSEED is not set. All block hashes are chained from NONE_HASH, so D's hashes never match C's hashes. Fix: C's bootstrap server now accepts token_ids and does the prefix cache lookup locally using C's own hash function and block pool. No cross-instance hash matching needed. New flow: D sends prompt token_ids → C computes hashes on C's side → C looks up in C's own BlockPool → returns block_ids. Also: module-level _shared_block_pool for scheduler→bootstrap bridge, prompt_token_ids passed through PullReqMeta, test script added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:14:33 +08:00
Gahow Wang	a1f30e5fce	Add hash_table_sync logging + gap analysis Root cause of 0 cache hits on offloaded requests identified: - Hash table sync IS working (scheduler→metadata→worker→bootstrap) - But D's query_blocks returns no matches → hash format mismatch between D's request.block_hashes and C's synced hashes The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because D does FULL cold prefill (cache_hit=0), not partial prefill with RDMA-read cached blocks. Next: debug hash format mismatch between D and C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 00:38:14 +08:00
Gahow Wang	29b901b145	Fix scheduler assertion crash on partial remote prefill finished_recving The assertion `assert RequestStatus.is_finished(req.status)` at scheduler.py:2109 fires when a partial-remote-prefill request receives `finished_recving` while in RUNNING state (local prefill already started before RDMA read completed). This was the root cause of 67% error rate: EngineCore crashed with "fatal error" assertion, killing the vLLM instance. Fix: Replace assertion with debug log for non-WAITING, non-finished requests. kv_both no-offload baseline confirmed 0 errors, proving the crash was from our scheduler patch, not kv_both instability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 23:33:26 +08:00
Gahow Wang	23788f7cd5	Fix: import field from dataclasses for PullReqMeta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:29:24 +08:00
Gahow Wang	a7df84bd3b	Direct RDMA read: D reads cached KV from C's GPU without C's scheduler Complete implementation of direct RDMA read for KV cache migration: vLLM Mooncake connector (mooncake_connector.py): - PullReqMeta: add direct_read flag + block_hashes - MooncakeConnectorMetadata: add hash_table_updates/removals for scheduler->worker block hash sync - MooncakeConnectorScheduler: set_block_pool() to access BlockPool, build_connector_meta() computes hash table deltas each step, update_state_after_alloc() captures request block hashes for direct_read - MooncakeConnectorWorker: _start_direct_read() + _direct_read_single() implements D-side RDMA read via batch_transfer_sync_read, with HTTP query/unpin to C's bootstrap server Bootstrap server (mooncake_utils.py): - POST /query_blocks: look up block hashes, return block_ids + GPU layout - POST /unpin_blocks: release pin tracking - set_worker_kv_info(): register GPU addresses at init - update_hash_table(): receive scheduler deltas each step Scheduler (scheduler.py): - One-line hookup: pass block_pool to connector after KVCacheManager init Proxy (cache_aware_proxy.py): - _handle_direct_read_offload: sends request ONLY to D with direct_read=True + remote_bootstrap_addr. No request to C at all. - C's scheduler is completely uninvolved (0 GPU time on C) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:13 +08:00
Gahow Wang	ea5149726c	Partial remote prefill: C_s exports cache, D computes new tokens locally vLLM Mooncake patch: - get_num_new_matched_tokens: support remote_num_tokens parameter for partial remote prefill (pull N tokens from remote, compute rest locally) - update_state_after_alloc: only allocate receive blocks for external portion Proxy _handle_heavy_offload rewrite: - Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute) - Step 2: D pulls cached blocks + does local prefill for new tokens + decodes - C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt This enables true session migration: C_s releases cache, D takes over. C_s's GPU is freed immediately (no compute), vs old approach where C_s had to do full prefill (1-15s GPU occupancy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 20:04:13 +08:00
Gahow Wang	3bc37cc6d5	PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 02:14:37 +08:00
Gahow Wang	445e491123	Add vLLM v0.18.1 source tree with KV transfer abort fix third_party/vllm/ now tracked in git for direct patch management. Based on vLLM v0.18.1 release with one patch applied: vllm/v1/core/sched/scheduler.py: Replace fatal assert with graceful skip when KV transfer callback arrives for an already-aborted request during PD disaggregated serving. Future vLLM modifications should be made directly in third_party/vllm/ and committed normally. The patches/ directory is kept as documentation of what changed from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:30:38 +08:00

19 Commits