Gahow Wang 5c66f500fc Fix offload gate: remove cache_gate for direct RDMA read, fix cost model
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).

Also fixed cost model: offload_cost now reflects direct read reality:
  OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
  NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)

Offload wins when C_s queue > RDMA_overhead (~2s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:01:43 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%