Gahow Wang 6b255fad91 Unified routing: single argmin(expected_latency) over all instances
Replace two-phase routing (pick_instance → offload gate) with a single
cost function evaluated per instance:

  latency(D) = queue(D) + prefill_time(D) + transfer_cost(D)

  - If D has local cache: prefill = (input - local_hit) / throughput
  - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma
  - Otherwise: prefill = input / throughput (cold)

Choose argmin(latency). If the winner needs PUSH → trigger migration.

Removed:
- WARM/MEDIUM/HEAVY classification (no routing purpose)
- heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio
- Interference penalty magic number (0.3)
- Separate pick_instance + offload gate stages

Only 2 measured parameters remain:
- prefill_throughput = 7000 tokens/s (H20 measured)
- rdma_overhead_s = 0.1s (RDMA PUSH measured)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:21:34 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%