When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.
Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.
Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RDMA READ (batch_transfer_sync_read) fails on GPU memory because
batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE.
New approach: D sends /push_blocks to C's bootstrap with token_ids
+ D's GPU addresses. C's bootstrap:
1. Looks up matching blocks in synced hash table (640/640 verified)
2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks
directly into D's GPU memory
3. Returns match count + push status
C's scheduler is still NOT involved (0 GPU compute on C).
The push uses C's worker thread + existing RDMA WRITE path (proven reliable).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.
Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.
New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.
Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of 0 cache hits on offloaded requests identified:
- Hash table sync IS working (scheduler→metadata→worker→bootstrap)
- But D's query_blocks returns no matches → hash format mismatch
between D's request.block_hashes and C's synced hashes
The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because
D does FULL cold prefill (cache_hit=0), not partial prefill with
RDMA-read cached blocks.
Next: debug hash format mismatch between D and C.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The assertion `assert RequestStatus.is_finished(req.status)` at
scheduler.py:2109 fires when a partial-remote-prefill request
receives `finished_recving` while in RUNNING state (local prefill
already started before RDMA read completed).
This was the root cause of 67% error rate: EngineCore crashed with
"fatal error" assertion, killing the vLLM instance.
Fix: Replace assertion with debug log for non-WAITING, non-finished
requests. kv_both no-offload baseline confirmed 0 errors, proving
the crash was from our scheduler patch, not kv_both instability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete implementation of direct RDMA read for KV cache migration:
vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
build_connector_meta() computes hash table deltas each step,
update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
implements D-side RDMA read via batch_transfer_sync_read, with
HTTP query/unpin to C's bootstrap server
Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step
Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init
Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion
Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt
This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:
vllm/v1/core/sched/scheduler.py:
Replace fatal assert with graceful skip when KV transfer callback
arrives for an already-aborted request during PD disaggregated serving.
Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>