agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	448361cf83	Update design doc: final results + review findings Unified routing (baseline mode) beats LMetric E2E mean/p50/p90. PD-sep offload consistently degrades performance (5-134 offloads tested). Independent review: fair comparison, no reward hacking, needs multi-run significance verification (running 3x paired test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:48:18 +08:00
Gahow Wang	4c583f2f1c	Revert relaxed gate + push_cost fix: 134 offloads destroyed performance PD-sep offload overhead (C queue + prefill + KV transfer + D schedule) far exceeds any load balance benefit. With relaxed gate, cost model triggered 134 offloads → E2E p90 went from 37s to 82s. The proven winning configuration is Unified routing in baseline mode (no Mooncake connector), which beats LMetric on E2E mean/p50/p90 purely through better routing (contention-aware + session affinity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:38:59 +08:00
Gahow Wang	bf4469a150	Fix cost model: accurate push_cost + aligned hard gate 1. push_cost now models both C and D: max(c_cost, d_cost) where c_cost includes C's queue + prefill, d_cost includes D's queue + RDMA overhead. Old formula only had D's contention + RDMA. 2. Hard gate uses num_requests instead of ongoing_tokens, aligning with the contention-based cost model. 3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 01:01:03 +08:00
Gahow Wang	1d2148cf65	Remove second push_new gate that caused downgrade-to-cold-LOCAL After _push_allowed was relaxed, the cost model correctly chose push for high-cache sessions on overloaded instances. But a second gate at execution time (push_new < heavy_threshold) blocked the actual offload, downgrading to LOCAL on the target instance — which had no cache. Worse, session affinity was already updated to the target, so all subsequent turns also hit cold prefill. This was the root cause of relaxed gate's performance regression: affinity broken + push blocked = worst of both worlds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:42:31 +08:00
Gahow Wang	3ae99293fd	Relax _push_allowed: gate on request size, not cache savings The old gate blocked offload when push_new (= input - cache_hit) < 20K, which prevented migration of high-cache sessions — exactly the ones that benefit most. After PD-sep, the target receives full KV via RDMA and has the same cache as the source, so cache_hit is irrelevant to the offload decision. New gate: only check input_length >= heavy_threshold (request must be HEAVY) and max_offload_inflight (concurrency cap). Let the cost model decide whether the contention difference justifies migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:03:28 +08:00
Gahow Wang	cc6e5625bb	Revert Approach B (session migration): overhead exceeds LB benefit Reverts 3 commits: `e991960`, `5772149`, `5b1d360`. 57 migrations triggered but PD-sep overhead (C queue + KV transfer + D cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s. Migration mechanism needs fundamental rework before it can help. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 23:43:47 +08:00
Gahow Wang	5b1d36080a	Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg) The session migration path was calling _handle_cached_prefill_offload with swapped c_inst/d_inst and missing cache_hit parameter, causing TypeError on every migration attempt (13 of 41 errors in the test run). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 22:46:46 +08:00
Gahow Wang	5772149d36	Approach B v2: TTFT-based migration trigger Replace num_requests threshold with recent TTFT median as migration trigger. Track per-instance rolling TTFT (last 8 requests) and trigger migration when median > 5s (configurable). Target is the instance with lowest recent TTFT, requiring > 2x improvement to justify migration. This is more responsive than the instantaneous num_requests signal because TTFT directly measures the user-facing impact of contention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 21:54:06 +08:00
Gahow Wang	45b82272c3	Add migration policy design doc with A/B experiment results Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 18:24:49 +08:00
Gahow Wang	e9919605af	Approach B: session-level lazy migration trigger When a request arrives for a session on an overloaded instance, force migration if three conditions hold: 1. Instance busy: num_requests > avg * migration_request_factor (1.5x) 2. Session has cache value: cache_ratio > 50% 3. Request is HEAVY (>= heavy_threshold) 4. A meaningfully less-loaded target exists (num_requests gap > 2) This bypasses the cost model for migration decisions — the cost model's cache-inflated costs prevented migration even when instances had 150s queue times with 99% cache hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:34:06 +08:00
Gahow Wang	e06de5144b	Approach A: contention-aware cost model with migration discount Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:24:27 +08:00
Gahow Wang	e13391eeab	Evict migrated blocks from prefix cache after KV send completes After a session migrates from C to D via offload, C's blocks were freed to the LRU tail (most-recently-used position), making them the last to be evicted. Since the session won't return to C, these blocks are dead weight occupying cache capacity. Now capture block IDs before _free_blocks and call evict_blocks to remove them from the prefix cache hash table, so they can be reused sooner for active sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:56:34 +08:00
Gahow Wang	4b50c5a08d	Fix unified cost model: include decode load in queue + hard overload gate Two bugs caused elastic to concentrate load on cached instances (10x token imbalance vs 2.7x baseline): 1. _instance_cost queue only counted pending_prefill_tokens, missing ongoing_decode_tokens entirely — instances with 50 decoding requests appeared idle to the cost model. 2. Cache hits made overloaded instances look "cheap", creating a positive feedback loop: more sessions → more cache → lower cost → more routing. Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks affinity before the cost model runs, matching linear policy behavior. Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:25:02 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	cc4a9c91e7	Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash The standalone hash computation in estimate_hit produced different hashes than the hash_table (synced from scheduler). Root cause unclear (possibly pickle serialization differences or hash chain state). Fix: delegate to _lookup_by_tokens which is proven to work (push_blocks uses it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 12:41:53 +08:00
Gahow Wang	657812f8c4	Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from third_party/vllm to the pip-installed vllm's site-packages. C extensions stay from the pip package; only Python files are overridden. Usage: bash scripts/deploy_vllm_patches.sh [HOST] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:59:52 +08:00
Gahow Wang	bf76273778	Add --offload-mode switch for ablation (direct_read vs cached_prefill) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:24:15 +08:00
Gahow Wang	cdf83493ab	Fix A+C: real cache sync + cached-prefill-on-C architecture A: Add /estimate_hit endpoint to bootstrap server for real-time cache probing. Proxy queries this before committing to PUSH, eliminating 24% zero-match PUSH requests (shadow cache divergence). C: Add _handle_cached_prefill_offload: C (cache source) does fast cached prefill → KV to Mooncake → D pulls and decodes. Replaces broken direct_read PUSH where D waited for RDMA transfer while occupying KV blocks without doing compute. Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:22:38 +08:00
Gahow Wang	2b9eae0d54	Report §3.9: Unified routing final results — TTFT -25%, E2E -7% 850/850, 0 errors. Single argmin(latency) with soft affinity. 116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL. TPOT p90 +15% tradeoff from kv_both overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 03:15:32 +08:00
Gahow Wang	97f4fe5164	Fix: rename inst->chosen in generate function (NameError crash) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:55:01 +08:00
Gahow Wang	5892739159	Add session affinity as soft preference in unified routing Without affinity, all cached requests route to the same instance (cache source always has lowest prefill cost), causing 149s queue. Fix: if the session's last instance has cost <= 2x the global best, use it (preserves cache locality). Only re-route when the affinity instance is significantly more expensive (overloaded). The 2x threshold is intentionally loose — it's not a hardcoded magic number but a "prefer locality unless clearly worse" heuristic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:37:58 +08:00
Gahow Wang	6b255fad91	Unified routing: single argmin(expected_latency) over all instances Replace two-phase routing (pick_instance → offload gate) with a single cost function evaluated per instance: latency(D) = queue(D) + prefill_time(D) + transfer_cost(D) - If D has local cache: prefill = (input - local_hit) / throughput - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma - Otherwise: prefill = input / throughput (cold) Choose argmin(latency). If the winner needs PUSH → trigger migration. Removed: - WARM/MEDIUM/HEAVY classification (no routing purpose) - heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio - Interference penalty magic number (0.3) - Separate pick_instance + offload gate stages Only 2 measured parameters remain: - prefill_throughput = 7000 tokens/s (H20 measured) - rdma_overhead_s = 0.1s (RDMA PUSH measured) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:21:34 +08:00
Gahow Wang	1cd0a18e2c	Report §3.8: Document direct KV cache migration architecture + bugs fixed Complete documentation of bootstrap-triggered PUSH implementation: hash table sync, token-based lookup, RDMA WRITE path, cost model, PYTHONHASHSEED requirement, and all 6 bugs fixed during development. Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s (vs local cache 0.338s, +0.03s overhead). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:52:38 +08:00
Gahow Wang	8c267ec54e	VERIFIED: Bootstrap-triggered PUSH works end-to-end! Test results: - 640/640 blocks matched and pushed (ret=0) - External prefix cache hit rate: 80.0% on D - Turn 2 TTFT: inst_0 (cached) = 0.338s, inst_1 (RDMA push) = 0.367s - C's scheduler was NOT involved (0 GPU compute on C) The complete direct KV cache migration pipeline is working: D → /push_blocks → C bootstrap matches tokens → C RDMA WRITE → D GPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:50:37 +08:00
Gahow Wang	e3a1d70cf2	Switch from RDMA READ to bootstrap-triggered PUSH RDMA READ (batch_transfer_sync_read) fails on GPU memory because batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE. New approach: D sends /push_blocks to C's bootstrap with token_ids + D's GPU addresses. C's bootstrap: 1. Looks up matching blocks in synced hash table (640/640 verified) 2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks directly into D's GPU memory 3. Returns match count + push status C's scheduler is still NOT involved (0 GPU compute on C). The push uses C's worker thread + existing RDMA WRITE path (proven reliable). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:47:49 +08:00
Gahow Wang	6716a3401a	Progress: hash matching FIXED (640/640), RDMA read returns -1 Hash mismatch root cause: sha256_cbor vs sha256 (default) + NONE_HASH from-import value binding. Both fixed. Now 640/640 blocks matched. RDMA read (batch_transfer_sync_read) fails with ret=-1. Likely cause: Mooncake TransferEngine may not support RDMA READ to arbitrary registered memory without explicit permission setup. The PUSH path (batch_transfer_sync_write) works because the sender initiates, but PULL may need additional RDMA MR access flags. Next: investigate Mooncake's RDMA read permission model, or fall back to a two-step approach: D sends query → C responds with blocks via batch_transfer_sync_write (existing PUSH path), but triggered by the bootstrap server instead of the scheduler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:40:52 +08:00
Gahow Wang	0bb6a67ed3	Fix: use sha256 (default) not sha256_cbor for block hash computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:36:05 +08:00
Gahow Wang	08d5e12838	Fix NONE_HASH import: use module ref instead of from-import (value binding bug) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:32:19 +08:00
Gahow Wang	7e91b83d88	Set PYTHONHASHSEED=42 for elastic mode to ensure consistent block hashes Root cause confirmed: NONE_HASH = os.urandom(32) differs between scheduler and bootstrap server even in the same process (init_none_hash called separately by each import path). PYTHONHASHSEED makes it deterministic: NONE_HASH = hash_fn(seed), same across all code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:27:52 +08:00
Gahow Wang	ee2301ae17	Fix: token lookup condition should check hash_table not block_pool Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:21:49 +08:00
Gahow Wang	0c88609caa	Fix: use synced hash table + sha256_cbor for token-based lookup (same process NONE_HASH) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:18:47 +08:00
Gahow Wang	0500350849	Fix hash mismatch: token-based lookup instead of cross-instance hash matching Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32)) when PYTHONHASHSEED is not set. All block hashes are chained from NONE_HASH, so D's hashes never match C's hashes. Fix: C's bootstrap server now accepts token_ids and does the prefix cache lookup locally using C's own hash function and block pool. No cross-instance hash matching needed. New flow: D sends prompt token_ids → C computes hashes on C's side → C looks up in C's own BlockPool → returns block_ids. Also: module-level _shared_block_pool for scheduler→bootstrap bridge, prompt_token_ids passed through PullReqMeta, test script added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:14:33 +08:00
Gahow Wang	a1f30e5fce	Add hash_table_sync logging + gap analysis Root cause of 0 cache hits on offloaded requests identified: - Hash table sync IS working (scheduler→metadata→worker→bootstrap) - But D's query_blocks returns no matches → hash format mismatch between D's request.block_hashes and C's synced hashes The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because D does FULL cold prefill (cache_hit=0), not partial prefill with RDMA-read cached blocks. Next: debug hash format mismatch between D and C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 00:38:14 +08:00
Gahow Wang	1cf03c6e79	Cost model: add interference penalty for co-located heavy prefill Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded. New: colocated_cost includes interference penalty when C_s has decode requests: penalty = prefill_time × min(num_requests, 3) × 0.3. Offload now wins when C_s has 1+ concurrent request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 23:59:06 +08:00
Gahow Wang	29b901b145	Fix scheduler assertion crash on partial remote prefill finished_recving The assertion `assert RequestStatus.is_finished(req.status)` at scheduler.py:2109 fires when a partial-remote-prefill request receives `finished_recving` while in RUNNING state (local prefill already started before RDMA read completed). This was the root cause of 67% error rate: EngineCore crashed with "fatal error" assertion, killing the vLLM instance. Fix: Replace assertion with debug log for non-WAITING, non-finished requests. kv_both no-offload baseline confirmed 0 errors, proving the crash was from our scheduler patch, not kv_both instability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 23:33:26 +08:00
Gahow Wang	4f93bb5b8a	Report §3.8: Direct RDMA read results — HEAVY TTFT -70%, TPOT p90 -38% D reads C's cached KV blocks via batch_transfer_sync_read, bypassing C's scheduler entirely. 65/318 HEAVY requests offloaded. HEAVY_OFFLOAD TTFT: 3.40s vs HEAVY_COLO 11.21s (-70%) Overall TPOT p90: 0.100 vs baseline 0.162 (-38%) kv_both mode has 67.5% error rate (Mooncake instability), but 276 successful requests show strong performance improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:56:16 +08:00
Gahow Wang	a7514fc3d5	Fix retry syntax: async generator can't use return, use break+try/finally Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:37:32 +08:00
Gahow Wang	daeb95eca0	RDMA overhead 2.0→0.1s (direct read is raw memory, not scheduler flow) + retry on ConnectError to handle kv_both connection instability With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:33:10 +08:00
Gahow Wang	5c66f500fc	Fix offload gate: remove cache_gate for direct RDMA read, fix cost model The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%) because they were cold (cache_ratio=0). But with direct RDMA read, D reads C's cached blocks via RDMA regardless of cache ratio — the gate was protecting against the OLD flow (C does prefill + push). Also fixed cost model: offload_cost now reflects direct read reality: OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive) NEW: D_queue + RDMA_read + D_local_prefill(new_tokens) Offload wins when C_s queue > RDMA_overhead (~2s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:01:43 +08:00
Gahow Wang	23788f7cd5	Fix: import field from dataclasses for PullReqMeta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:29:24 +08:00
Gahow Wang	1dea82f2ff	launch_phase1_ps: parameterise project + model paths (B6 followup)	2026-05-23 21:14:15 +08:00
Gahow Wang	52a54e44af	proxy: split session_affinity per mode + vLLM patch self-check (M4, S2) - Replace the global session_affinity dict with two namespace-isolated ones (combined / prefill) so a session_id never indexes the wrong instance list across mode switches. Keep `session_affinity` as a read-only alias to the combined dict for any existing tooling. - Add a startup _verify_vllm_patch() that scans vllm.v1.core.sched.scheduler.Scheduler for the original `assert req_id in self.requests` line. If the patch was not re-applied after a vLLM upgrade we now print a loud warning at lifespan startup instead of dying mid-experiment on a KV-transfer abort race.	2026-05-23 21:12:56 +08:00
Gahow Wang	c843f2e3db	proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5) - Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/ MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/ CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton. __main__ now mutates SETTINGS so CLI overrides survive even when the module is imported as a library (e.g. by tests/) (D5). - Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS. - Add --cache-gate-ratio CLI flag and a real gate before the cost-model branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and fall back to colocated. cache_ratio is no longer a write-only field (B4). - P candidate selection penalises instances already running offloaded HEAVY prefills, so back-to-back HEAVY requests don't pile onto the same P (M2). - bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the proxy. - Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload penalty.	2026-05-23 21:11:17 +08:00
Gahow Wang	0701f84c00	tests: add minimal coverage for percentile + proxy routing (S1) - tests/test_metrics.py asserts the new linear-interp _percentile against hand-computed expected values (single value, two-value interpolation, endpoints, numpy-equivalent linear default, on-integer rank). - tests/test_proxy_pick.py exercises InstanceState LRU eviction and move-to-end on hit, plus session-affinity stickiness, the overload fallback, the active_p_offloads penalty, and lmetric scoring. The proxy is loaded by file path with stub fastapi/uvicorn/httpx modules so the suite runs without the FastAPI server deps installed. - pyproject.toml gets a hatchling wheel target and a [tool.pytest] section so `uv run --extra dev pytest` works out of the box.	2026-05-23 21:07:14 +08:00
Gahow Wang	7c7f8b951a	replayer: wire --max-inflight-sessions cap into replay loop (B2) Trace-driven dispatch is preserved by default (semaphore=None when the flag is not set), but operators can now cap concurrent sessions to reproduce session-admission scenarios from earlier sweeps without artificial time compression.	2026-05-23 21:04:09 +08:00
Gahow Wang	2c7f7fdaae	replayer: restore optional max_inflight_sessions for backwards compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:26 +08:00
Gahow Wang	a7df84bd3b	Direct RDMA read: D reads cached KV from C's GPU without C's scheduler Complete implementation of direct RDMA read for KV cache migration: vLLM Mooncake connector (mooncake_connector.py): - PullReqMeta: add direct_read flag + block_hashes - MooncakeConnectorMetadata: add hash_table_updates/removals for scheduler->worker block hash sync - MooncakeConnectorScheduler: set_block_pool() to access BlockPool, build_connector_meta() computes hash table deltas each step, update_state_after_alloc() captures request block hashes for direct_read - MooncakeConnectorWorker: _start_direct_read() + _direct_read_single() implements D-side RDMA read via batch_transfer_sync_read, with HTTP query/unpin to C's bootstrap server Bootstrap server (mooncake_utils.py): - POST /query_blocks: look up block hashes, return block_ids + GPU layout - POST /unpin_blocks: release pin tracking - set_worker_kv_info(): register GPU addresses at init - update_hash_table(): receive scheduler deltas each step Scheduler (scheduler.py): - One-line hookup: pass block_pool to connector after KVCacheManager init Proxy (cache_aware_proxy.py): - _handle_direct_read_offload: sends request ONLY to D with direct_read=True + remote_bootstrap_addr. No request to C at all. - C's scheduler is completely uninvolved (0 GPU time on C) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:13 +08:00
Gahow Wang	020be9f444	proxy: real LRU for cached_blocks + shadow-state reconcile loop (M1, M5) M1: cached_blocks was a plain set with a "trim half via list slicing" eviction. CPython does not guarantee set iteration order, so the trim discarded an arbitrary half of the entries — completely unlike vLLM's LRU and a known contributor to the router's cache_hit estimate diverging from real APC. Replace with an OrderedDict-backed LRU: move_to_end on hits, popitem(last=False) on overflow. Capacity exposed as CACHE_CAPACITY_BLOCKS module constant (200000 by default). M5: streamed responses decrement load counters in their generator's finally block. If a client disconnects before consuming the body the generator is never entered and the decrement is lost, causing ongoing_tokens / num_requests / pending_prefill_tokens to drift negative under load. Add a 60s background reconcile_loop that clamps those counters at zero as a safety net. Started in lifespan, cancelled on shutdown. Does not replace proper vLLM exact-state syncing.	2026-05-23 21:00:35 +08:00
Gahow Wang	0ed1ce200e	metrics: replace round-based percentile with linear interpolation (B5) The previous implementation used round((n-1) * pct), which under Python's banker's rounding returned the upper-middle element on every even-length array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary JSONs were biased upward at p50 as a result. Match numpy.percentile's default linear interpolation between the two adjacent sorted values.	2026-05-23 21:00:24 +08:00
Gahow Wang	0958823cdb	REPORT: add §1.1 errata flagging superseded sections (S3) Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU cap) and the early elastic v3 warm-vs-fresh runs are no longer current, and that the "--max-inflight-sessions 64+" next-step text refers to a flag that was removed and must be restored per FIXES.md §B2 before those numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.	2026-05-23 20:58:38 +08:00

... 2 3 4 5 6

253 Commits