6b255fad91a7bc50d754fd94f16b00f4f009356e
Replace two-phase routing (pick_instance → offload gate) with a single cost function evaluated per instance: latency(D) = queue(D) + prefill_time(D) + transfer_cost(D) - If D has local cache: prefill = (input - local_hit) / throughput - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma - Otherwise: prefill = input / throughput (cold) Choose argmin(latency). If the winner needs PUSH → trigger migration. Removed: - WARM/MEDIUM/HEAVY classification (no routing purpose) - heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio - Interference penalty magic number (0.3) - Separate pick_instance + offload gate stages Only 2 measured parameters remain: - prefill_throughput = 7000 tokens/s (H20 measured) - rdma_overhead_s = 0.1s (RDMA PUSH measured) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%