Report §3.8: Document direct KV cache migration architecture + bugs fixed

Complete documentation of bootstrap-triggered PUSH implementation:
hash table sync, token-based lookup, RDMA WRITE path, cost model,
PYTHONHASHSEED requirement, and all 6 bugs fixed during development.

Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s
(vs local cache 0.338s, +0.03s overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-24 01:52:38 +08:00
parent 8c267ec54e
commit 1cd0a18e2c

View File

@@ -322,33 +322,39 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
### 3.8 Direct RDMA Read: D reads C's GPU cache without C's scheduler
### 3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)
vLLM Mooncake patch: D queries C's bootstrap `/query_blocks` for block mapping, then uses `batch_transfer_sync_read` to RDMA-read cached KV blocks directly from C's GPU memory. C's scheduler is NOT involved (0 GPU time on C). D then does local prefill for new tokens + decode.
**Architecture**: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.
**Results (eval_direct_rdma_v4, 850 req, elastic mode):**
**Implementation details** (vLLM + Mooncake patches):
| Metric | Baseline | Direct RDMA | Delta |
|--------|----------|-------------|-------|
| TTFT mean | 4.35s | **2.96s** | **-32%** |
| TTFT p90 | 11.67s | **5.77s** | **-51%** |
| TPOT p50 | 0.070 | **0.073** | +4% |
| TPOT p90 | 0.162 | **0.100** | **-38%** |
| Errors | 0/850 | 574/850 (67.5%) | kv_both instability |
1. **Hash table sync** (scheduler worker bootstrap): Each step, scheduler computes delta of `BlockPool.cached_block_hash_to_block` and syncs to worker's bootstrap server via `MooncakeConnectorMetadata.hash_table_updates`.
Per-class TTFT comparison:
2. **Token-based block lookup**: D sends `POST /push_blocks` with prompt `token_ids` + D's GPU addresses. C's bootstrap computes block hashes using `sha256` + `NONE_HASH` (same hash function as scheduler), matches against synced hash table.
| Path | Count | TTFT mean | TTFT p50 | TTFT p90 |
|------|-------|-----------|----------|----------|
| HEAVY_COLO | 253 | 11.21s | 6.95s | 27.48s |
| **HEAVY_OFFLOAD** | **65** | **3.40s** | **3.12s** | **6.56s** |
| **Delta** | | **-70%** | **-55%** | **-76%** |
3. **RDMA PUSH**: C's bootstrap calls `TransferEngine.batch_transfer_sync_write` to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on `batch_register_memory`'d GPU memory due to missing `IBV_ACCESS_REMOTE_READ` flags).
**Direct RDMA read reduces HEAVY TTFT by 70%** by eliminating C's scheduler queue (was 7-14s) and replacing it with raw RDMA read (~0.1s). TPOT p90 improves 38% from reduced prefill-decode interference.
4. **Cost model**: `offload when colocated_cost + interference > offload_cost`, where `interference = prefill_time × min(num_requests, 3) × 0.3`. Offload triggers when C has 1+ concurrent request.
**Remaining issue**: kv_both mode has 67.5% `RemoteProtocolError` rate under trace-driven concurrency (Mooncake background threads destabilize HTTP connections). Baseline mode has 0 errors. This is a Mooncake stability issue, not a direct-RDMA-read bug the 276 successful requests show correct functionality and strong performance improvement.
5. **Requirements**: `PYTHONHASHSEED` must be set (bench.sh sets `PYTHONHASHSEED=42` for elastic mode) to ensure deterministic `NONE_HASH` across scheduler/worker code paths.
**Output**: `outputs/eval_direct_rdma_v4/` on dash0.
**Minimal test verification** (`scripts/test_direct_read.py`):
| Metric | inst_0 (local cache) | inst_1 (RDMA push from inst_0) |
|--------|---------------------|-------------------------------|
| Turn 2 TTFT | 0.338s | **0.367s** |
| Blocks transferred | | **640/640 matched, push ret=0** |
| External APC | 0% | **80%** |
**Key bugs fixed during development**:
- `NameError: field not imported` missing dataclass import
- Scheduler assertion crash (`assert RequestStatus.is_finished`) partial remote prefill state mismatch
- Hash mismatch 0/640 `sha256` vs `sha256_cbor` (default hash algo is `sha256`, not `sha256_cbor`)
- Hash mismatch 0/640 `from X import NONE_HASH` creates stale value binding after `init_none_hash` reassigns the global; fixed with `import X; X.NONE_HASH`
- RDMA READ ret=-1 `batch_register_memory` only sets `IBV_ACCESS_REMOTE_WRITE`; switched to bootstrap-triggered PUSH
- Cost model 0% trigger removed stale `cache_gate_ratio` check; added interference penalty
**Output**: `outputs/eval_direct_rdma_v*/` on dash0.
## 4. System-Level Analysis