Approach A (contention-aware cost model): TTFT p90 -52% vs baseline.
Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a request arrives for a session on an overloaded instance, force
migration if three conditions hold:
1. Instance busy: num_requests > avg * migration_request_factor (1.5x)
2. Session has cache value: cache_ratio > 50%
3. Request is HEAVY (>= heavy_threshold)
4. A meaningfully less-loaded target exists (num_requests gap > 2)
This bypasses the cost model for migration decisions — the cost model's
cache-inflated costs prevented migration even when instances had 150s
queue times with 99% cache hit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.
Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):
1. _instance_cost queue only counted pending_prefill_tokens, missing
ongoing_decode_tokens entirely — instances with 50 decoding requests
appeared idle to the cost model.
2. Cache hits made overloaded instances look "cheap", creating a positive
feedback loop: more sessions → more cache → lower cost → more routing.
Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
affinity before the cost model runs, matching linear policy behavior.
Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.
Usage: bash scripts/deploy_vllm_patches.sh [HOST]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.
Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).
The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test results:
- 640/640 blocks matched and pushed (ret=0)
- External prefix cache hit rate: 80.0% on D
- Turn 2 TTFT: inst_0 (cached) = 0.338s, inst_1 (RDMA push) = 0.367s
- C's scheduler was NOT involved (0 GPU compute on C)
The complete direct KV cache migration pipeline is working:
D → /push_blocks → C bootstrap matches tokens → C RDMA WRITE → D GPU
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RDMA READ (batch_transfer_sync_read) fails on GPU memory because
batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE.
New approach: D sends /push_blocks to C's bootstrap with token_ids
+ D's GPU addresses. C's bootstrap:
1. Looks up matching blocks in synced hash table (640/640 verified)
2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks
directly into D's GPU memory
3. Returns match count + push status
C's scheduler is still NOT involved (0 GPU compute on C).
The push uses C's worker thread + existing RDMA WRITE path (proven reliable).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hash mismatch root cause: sha256_cbor vs sha256 (default) + NONE_HASH
from-import value binding. Both fixed. Now 640/640 blocks matched.
RDMA read (batch_transfer_sync_read) fails with ret=-1.
Likely cause: Mooncake TransferEngine may not support RDMA READ
to arbitrary registered memory without explicit permission setup.
The PUSH path (batch_transfer_sync_write) works because the sender
initiates, but PULL may need additional RDMA MR access flags.
Next: investigate Mooncake's RDMA read permission model, or
fall back to a two-step approach: D sends query → C responds
with blocks via batch_transfer_sync_write (existing PUSH path),
but triggered by the bootstrap server instead of the scheduler.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.
Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.
New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.
Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 0 cache hits on offloaded requests identified:
- Hash table sync IS working (scheduler→metadata→worker→bootstrap)
- But D's query_blocks returns no matches → hash format mismatch
between D's request.block_hashes and C's synced hashes
The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because
D does FULL cold prefill (cache_hit=0), not partial prefill with
RDMA-read cached blocks.
Next: debug hash format mismatch between D and C.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload
was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded.
New: colocated_cost includes interference penalty when C_s has decode
requests: penalty = prefill_time × min(num_requests, 3) × 0.3.
Offload now wins when C_s has 1+ concurrent request.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The assertion `assert RequestStatus.is_finished(req.status)` at
scheduler.py:2109 fires when a partial-remote-prefill request
receives `finished_recving` while in RUNNING state (local prefill
already started before RDMA read completed).
This was the root cause of 67% error rate: EngineCore crashed with
"fatal error" assertion, killing the vLLM instance.
Fix: Replace assertion with debug log for non-WAITING, non-finished
requests. kv_both no-offload baseline confirmed 0 errors, proving
the crash was from our scheduler patch, not kv_both instability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retry on ConnectError to handle kv_both connection instability
With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens
pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).
Also fixed cost model: offload_cost now reflects direct read reality:
OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)
Offload wins when C_s queue > RDMA_overhead (~2s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace the global session_affinity dict with two namespace-isolated
ones (combined / prefill) so a session_id never indexes the wrong
instance list across mode switches. Keep `session_affinity` as a
read-only alias to the combined dict for any existing tooling.
- Add a startup _verify_vllm_patch() that scans
vllm.v1.core.sched.scheduler.Scheduler for the original
`assert req_id in self.requests` line. If the patch was not
re-applied after a vLLM upgrade we now print a loud warning at
lifespan startup instead of dying mid-experiment on a KV-transfer
abort race.
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
__main__ now mutates SETTINGS so CLI overrides survive even when the
module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
fall back to colocated. cache_ratio is no longer a write-only field
(B4).
- P candidate selection penalises instances already running offloaded
HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
penalty.
- tests/test_metrics.py asserts the new linear-interp _percentile against
hand-computed expected values (single value, two-value interpolation,
endpoints, numpy-equivalent linear default, on-integer rank).
- tests/test_proxy_pick.py exercises InstanceState LRU eviction and
move-to-end on hit, plus session-affinity stickiness, the overload
fallback, the active_p_offloads penalty, and lmetric scoring. The
proxy is loaded by file path with stub fastapi/uvicorn/httpx modules
so the suite runs without the FastAPI server deps installed.
- pyproject.toml gets a hatchling wheel target and a [tool.pytest]
section so `uv run --extra dev pytest` works out of the box.
Trace-driven dispatch is preserved by default (semaphore=None when the
flag is not set), but operators can now cap concurrent sessions to
reproduce session-admission scenarios from earlier sweeps without
artificial time compression.
Complete implementation of direct RDMA read for KV cache migration:
vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
build_connector_meta() computes hash table deltas each step,
update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
implements D-side RDMA read via batch_transfer_sync_read, with
HTTP query/unpin to C's bootstrap server
Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step
Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init
Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).
M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.
The previous implementation used round((n-1) * pct), which under Python's
banker's rounding returned the upper-middle element on every even-length
array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary
JSONs were biased upward at p50 as a result. Match numpy.percentile's
default linear interpolation between the two adjacent sorted values.
Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU
cap) and the early elastic v3 warm-vs-fresh runs are no longer current,
and that the "--max-inflight-sessions 64+" next-step text refers to a
flag that was removed and must be restored per FIXES.md §B2 before those
numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
The hardcoded traces/sampled_1000req_seed42.jsonl no longer exists; switch
the default to the current sampled trace file w600_r0.0015_st30.jsonl and
let users override via --trace. Skip Part 4 cleanly when the file is
missing instead of relying on os.path.exists.
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.
D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.
Adds scripts/legacy/README.md to document the archival policy.
The default MODEL pointed at /home/admin/cpfs/... which never existed on
the public dev machines (other launch_*.sh and TODO.md use $HOME/models),
and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl
which was deleted when the sampler moved to window+thin output. Update
both to the values the rest of the repo already standardized on.
B1: _inst_cumulative_tokens was written by pick_instance but never read
anywhere; delete the variable, global declaration, and per-call increment.
Load is already tracked via inst.ongoing_tokens.
D1: _send_prefill_async + the --fire-and-forget branch were unreachable
in practice (no launch/bench script enabled the flag) and broken even if
exercised: D-decode would fire before P registered the transfer_id,
guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous
path and drop the CLI flag.
Captures the full review of bugs, fake/half-implemented features, dead
branches, and quality gaps found in cache_aware_proxy.py, replayer, and
the shell scripts. Each item has file:line, problem, fix, and verification
steps so any contributor can pick it up directly.
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion
Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt
This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old gate: cache_ratio >= 0.3 (static, only 14% of HEAVY triggered)
New gate: offload when offload_cost < colocated_cost, where:
colocated_cost = queue(C_s) + prefill(new_tokens)
offload_cost = queue(P_idle) + prefill(P_tokens) + RDMA_overhead
Key changes:
- P is now least-loaded instance (not session-sticky C_s)
- Gate considers C_s queue depth dynamically
- Crossover: offload wins when C_s queue >= 38k tokens (~5.4s)
- Cold HEAVY requests CAN be offloaded if C_s is busy enough
- P accounting uses P's actual cache hit, not C_s's
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>