Report §3.8: Document direct KV cache migration architecture + bugs fixed

Complete documentation of bootstrap-triggered PUSH implementation: hash table sync, token-based lookup, RDMA WRITE path, cost model, PYTHONHASHSEED requirement, and all 6 bugs fixed during development. Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s (vs local cache 0.338s, +0.03s overhead). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:52:38 +08:00
parent 8c267ec54e
commit 1cd0a18e2c
1 changed files with 25 additions and 19 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -322,33 +322,39 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall

 **Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.

-### 3.8 Direct RDMA Read: D reads C's GPU cache without C's scheduler
+### 3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)

-vLLM Mooncake patch: D queries C's bootstrap `/query_blocks` for block mapping, then uses `batch_transfer_sync_read` to RDMA-read cached KV blocks directly from C's GPU memory. C's scheduler is NOT involved (0 GPU time on C). D then does local prefill for new tokens + decode.
+**Architecture**: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.

-**Results (eval_direct_rdma_v4, 850 req, elastic mode):**
+**Implementation details** (vLLM + Mooncake patches):

-| Metric | Baseline | Direct RDMA | Delta |
-|--------|----------|-------------|-------|
-| TTFT mean | 4.35s | **2.96s** | **-32%** |
-| TTFT p90 | 11.67s | **5.77s** | **-51%** |
-| TPOT p50 | 0.070 | **0.073** | +4% |
-| TPOT p90 | 0.162 | **0.100** | **-38%** |
-| Errors | 0/850 | 574/850 (67.5%) | kv_both instability |
+1. **Hash table sync** (scheduler → worker → bootstrap): Each step, scheduler computes delta of `BlockPool.cached_block_hash_to_block` and syncs to worker's bootstrap server via `MooncakeConnectorMetadata.hash_table_updates`.

-Per-class TTFT comparison:
+2. **Token-based block lookup**: D sends `POST /push_blocks` with prompt `token_ids` + D's GPU addresses. C's bootstrap computes block hashes using `sha256` + `NONE_HASH` (same hash function as scheduler), matches against synced hash table.

-| Path | Count | TTFT mean | TTFT p50 | TTFT p90 |
-|------|-------|-----------|----------|----------|
-| HEAVY_COLO | 253 | 11.21s | 6.95s | 27.48s |
-| **HEAVY_OFFLOAD** | **65** | **3.40s** | **3.12s** | **6.56s** |
-| **Delta** | | **-70%** | **-55%** | **-76%** |
+3. **RDMA PUSH**: C's bootstrap calls `TransferEngine.batch_transfer_sync_write` to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on `batch_register_memory`'d GPU memory due to missing `IBV_ACCESS_REMOTE_READ` flags).

-**Direct RDMA read reduces HEAVY TTFT by 70%** by eliminating C's scheduler queue (was 7-14s) and replacing it with raw RDMA read (~0.1s). TPOT p90 improves 38% from reduced prefill-decode interference.
+4. **Cost model**: `offload when colocated_cost + interference > offload_cost`, where `interference = prefill_time × min(num_requests, 3) × 0.3`. Offload triggers when C has 1+ concurrent request.

-**Remaining issue**: kv_both mode has 67.5% `RemoteProtocolError` rate under trace-driven concurrency (Mooncake background threads destabilize HTTP connections). Baseline mode has 0 errors. This is a Mooncake stability issue, not a direct-RDMA-read bug — the 276 successful requests show correct functionality and strong performance improvement.
+5. **Requirements**: `PYTHONHASHSEED` must be set (bench.sh sets `PYTHONHASHSEED=42` for elastic mode) to ensure deterministic `NONE_HASH` across scheduler/worker code paths.

-**Output**: `outputs/eval_direct_rdma_v4/` on dash0.
+**Minimal test verification** (`scripts/test_direct_read.py`):
+
+| Metric | inst_0 (local cache) | inst_1 (RDMA push from inst_0) |
+|--------|---------------------|-------------------------------|
+| Turn 2 TTFT | 0.338s | **0.367s** |
+| Blocks transferred | — | **640/640 matched, push ret=0** |
+| External APC | 0% | **80%** |
+
+**Key bugs fixed during development**:
+- `NameError: field not imported` — missing dataclass import
+- Scheduler assertion crash (`assert RequestStatus.is_finished`) — partial remote prefill state mismatch
+- Hash mismatch 0/640 — `sha256` vs `sha256_cbor` (default hash algo is `sha256`, not `sha256_cbor`)
+- Hash mismatch 0/640 — `from X import NONE_HASH` creates stale value binding after `init_none_hash` reassigns the global; fixed with `import X; X.NONE_HASH`
+- RDMA READ ret=-1 — `batch_register_memory` only sets `IBV_ACCESS_REMOTE_WRITE`; switched to bootstrap-triggered PUSH
+- Cost model 0% trigger — removed stale `cache_gate_ratio` check; added interference penalty
+
+**Output**: `outputs/eval_direct_rdma_v*/` on dash0.

 ## 4. System-Level Analysis