Report §3.9: Unified routing final results — TTFT -25%, E2E -7%

850/850, 0 errors. Single argmin(latency) with soft affinity. 116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL. TPOT p90 +15% tradeoff from kv_both overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 03:15:32 +08:00
parent 97f4fe5164
commit 2b9eae0d54
1 changed files with 32 additions and 0 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -356,6 +356,38 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
 **Output**: `outputs/eval_direct_rdma_v*/` on dash0.
 ### 3.9 Unified Routing (Final Design)
 Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance:
 ```
 latency(D) = queue(D) + prefill(D) + transfer(D)
  - Local cache: prefill = (input - local_hit) / throughput, transfer = 0
  - PUSH from C: prefill = (input - push_hit) / throughput, transfer = 0.1s
  - Cold:        prefill = input / throughput, transfer = 0
 ```
 Session affinity as soft preference: use last instance if its cost ≤ 2× global best.
 Only 2 measured parameters: `prefill_throughput=7000 tok/s`, `rdma_overhead=0.1s`.
 **Results (eval_unified_v3, 850/850, 0 errors):**
 | Metric | Baseline | **Unified v3** | Delta |
 |--------|----------|---------------|-------|
 | TTFT mean | 4.35s | **3.24s** | **-25.5%** |
 | TTFT p50 | 0.95s | **0.78s** | **-17.9%** |
 | TTFT p90 | 12.47s | **7.79s** | **-37.5%** |
 | TPOT p90 | 0.177 | 0.204 | +14.9% |
 | E2E mean | 19.10s | **17.69s** | **-7.4%** |
 | E2E p50 | 6.44s | **5.48s** | **-14.9%** |
 **Routing**: 723 LOCAL + 116 PUSH_MIGRATE (13.8%). All 116 pushes had cache (avg 25k tokens) — no cold offloads. The unified cost model naturally avoids cold migration because `cold + RDMA > cold` (RDMA adds overhead without reducing prefill).
 **Tradeoff**: TPOT p90 +15% from kv_both background threads + PUSH operations. In exchange: TTFT -38%, E2E -15% at p50.
 **Output**: `outputs/eval_unified_v3/` on dash0.
 ## 4. System-Level Analysis
 ### 4.1 Elastic P2P Does Not Improve Single-Machine Performance