Report §3.9: Unified routing final results — TTFT -25%, E2E -7%

850/850, 0 errors. Single argmin(latency) with soft affinity.
116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL.
TPOT p90 +15% tradeoff from kv_both overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-24 03:15:32 +08:00
parent 97f4fe5164
commit 2b9eae0d54

View File

@@ -356,6 +356,38 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
**Output**: `outputs/eval_direct_rdma_v*/` on dash0.
### 3.9 Unified Routing (Final Design)
Replaced two-phase routing (pick_instance offload gate) with single `argmin(expected_latency)` per instance:
```
latency(D) = queue(D) + prefill(D) + transfer(D)
- Local cache: prefill = (input - local_hit) / throughput, transfer = 0
- PUSH from C: prefill = (input - push_hit) / throughput, transfer = 0.1s
- Cold: prefill = input / throughput, transfer = 0
```
Session affinity as soft preference: use last instance if its cost 2× global best.
Only 2 measured parameters: `prefill_throughput=7000 tok/s`, `rdma_overhead=0.1s`.
**Results (eval_unified_v3, 850/850, 0 errors):**
| Metric | Baseline | **Unified v3** | Delta |
|--------|----------|---------------|-------|
| TTFT mean | 4.35s | **3.24s** | **-25.5%** |
| TTFT p50 | 0.95s | **0.78s** | **-17.9%** |
| TTFT p90 | 12.47s | **7.79s** | **-37.5%** |
| TPOT p90 | 0.177 | 0.204 | +14.9% |
| E2E mean | 19.10s | **17.69s** | **-7.4%** |
| E2E p50 | 6.44s | **5.48s** | **-14.9%** |
**Routing**: 723 LOCAL + 116 PUSH_MIGRATE (13.8%). All 116 pushes had cache (avg 25k tokens) no cold offloads. The unified cost model naturally avoids cold migration because `cold + RDMA > cold` (RDMA adds overhead without reducing prefill).
**Tradeoff**: TPOT p90 +15% from kv_both background threads + PUSH operations. In exchange: TTFT -38%, E2E -15% at p50.
**Output**: `outputs/eval_unified_v3/` on dash0.
## 4. System-Level Analysis
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance