5c66f500fc8d53e7c994e6dff93b30516588e7c0
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%) because they were cold (cache_ratio=0). But with direct RDMA read, D reads C's cached blocks via RDMA regardless of cache ratio — the gate was protecting against the OLD flow (C does prefill + push). Also fixed cost model: offload_cost now reflects direct read reality: OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive) NEW: D_queue + RDMA_read + D_local_prefill(new_tokens) Offload wins when C_s queue > RDMA_overhead (~2s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%