20 Commits

Author SHA1 Message Date
657cd36f3d Gate evict_sent_blocks behind VLLM_EVICT_SENT_BLOCKS
Fork commit e13391e unconditionally evicts sent blocks from the prefix
cache on every KV transfer. That is correct only for session MIGRATION
(source won't see the session again); for plain PD-disagg producer->
consumer transfers it destroys cross-turn producer reuse and contaminates
PD reuse experiments. Default OFF; enable for migration runs via
VLLM_EVICT_SENT_BLOCKS=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:18:59 +08:00
5816aad731 A3: vLLM scheduler patch for step-level JSONL log
When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.

Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:11 +08:00
e13391eeab Evict migrated blocks from prefix cache after KV send completes
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.

Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:56:34 +08:00
9cebdb6b9b Fix multi-turn replay fidelity: track realized output tokens across all components
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.

- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
  to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
  prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
  track failed_recving_block_ids for error recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 14:47:51 +08:00
cc4a9c91e7 Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 12:41:53 +08:00
cdf83493ab Fix A+C: real cache sync + cached-prefill-on-C architecture
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
   probing. Proxy queries this before committing to PUSH, eliminating
   24% zero-match PUSH requests (shadow cache divergence).

C: Add _handle_cached_prefill_offload: C (cache source) does fast
   cached prefill → KV to Mooncake → D pulls and decodes.
   Replaces broken direct_read PUSH where D waited for RDMA transfer
   while occupying KV blocks without doing compute.

Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:22:38 +08:00
e3a1d70cf2 Switch from RDMA READ to bootstrap-triggered PUSH
RDMA READ (batch_transfer_sync_read) fails on GPU memory because
batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE.

New approach: D sends /push_blocks to C's bootstrap with token_ids
+ D's GPU addresses. C's bootstrap:
1. Looks up matching blocks in synced hash table (640/640 verified)
2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks
   directly into D's GPU memory
3. Returns match count + push status

C's scheduler is still NOT involved (0 GPU compute on C).
The push uses C's worker thread + existing RDMA WRITE path (proven reliable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:47:49 +08:00
0bb6a67ed3 Fix: use sha256 (default) not sha256_cbor for block hash computation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:36:05 +08:00
08d5e12838 Fix NONE_HASH import: use module ref instead of from-import (value binding bug)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:32:19 +08:00
7e91b83d88 Set PYTHONHASHSEED=42 for elastic mode to ensure consistent block hashes
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:27:52 +08:00
ee2301ae17 Fix: token lookup condition should check hash_table not block_pool
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:21:49 +08:00
0c88609caa Fix: use synced hash table + sha256_cbor for token-based lookup (same process NONE_HASH)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:18:47 +08:00
0500350849 Fix hash mismatch: token-based lookup instead of cross-instance hash matching
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.

Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.

New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.

Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:14:33 +08:00
a1f30e5fce Add hash_table_sync logging + gap analysis
Root cause of 0 cache hits on offloaded requests identified:
- Hash table sync IS working (scheduler→metadata→worker→bootstrap)
- But D's query_blocks returns no matches → hash format mismatch
  between D's request.block_hashes and C's synced hashes

The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because
D does FULL cold prefill (cache_hit=0), not partial prefill with
RDMA-read cached blocks.

Next: debug hash format mismatch between D and C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 00:38:14 +08:00
29b901b145 Fix scheduler assertion crash on partial remote prefill finished_recving
The assertion `assert RequestStatus.is_finished(req.status)` at
scheduler.py:2109 fires when a partial-remote-prefill request
receives `finished_recving` while in RUNNING state (local prefill
already started before RDMA read completed).

This was the root cause of 67% error rate: EngineCore crashed with
"fatal error" assertion, killing the vLLM instance.

Fix: Replace assertion with debug log for non-WAITING, non-finished
requests. kv_both no-offload baseline confirmed 0 errors, proving
the crash was from our scheduler patch, not kv_both instability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 23:33:26 +08:00
23788f7cd5 Fix: import field from dataclasses for PullReqMeta
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:29:24 +08:00
a7df84bd3b Direct RDMA read: D reads cached KV from C's GPU without C's scheduler
Complete implementation of direct RDMA read for KV cache migration:

vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
  scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
  build_connector_meta() computes hash table deltas each step,
  update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
  implements D-side RDMA read via batch_transfer_sync_read, with
  HTTP query/unpin to C's bootstrap server

Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step

Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init

Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
  direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:13 +08:00
ea5149726c Partial remote prefill: C_s exports cache, D computes new tokens locally
vLLM Mooncake patch:
- get_num_new_matched_tokens: support remote_num_tokens parameter for
  partial remote prefill (pull N tokens from remote, compute rest locally)
- update_state_after_alloc: only allocate receive blocks for external portion

Proxy _handle_heavy_offload rewrite:
- Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute)
- Step 2: D pulls cached blocks + does local prefill for new tokens + decodes
- C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt

This enables true session migration: C_s releases cache, D takes over.
C_s's GPU is freed immediately (no compute), vs old approach where C_s
had to do full prefill (1-15s GPU occupancy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 20:04:13 +08:00
3bc37cc6d5 PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix
Experiments run:
- Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise)
- PS V1 (cold prefill): REJECTED — PS always slower than cached C
- PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck
- V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal
- H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD
  TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance.
- H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s)

Key findings:
- Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead
- 92% of HEAVY are turn-1 cold → offloading cold requests always loses
- GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT)
- RDMA transfer negates the routing benefit for offloaded requests

Code changes:
- bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark)
- cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing
- mooncake_connector.py: elif→if fix (allow dual prefill+decode flags)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 02:14:37 +08:00
445e491123 Add vLLM v0.18.1 source tree with KV transfer abort fix
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:

  vllm/v1/core/sched/scheduler.py:
    Replace fatal assert with graceful skip when KV transfer callback
    arrives for an already-aborted request during PD disaggregated serving.

Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:30:38 +08:00