Fix offload gate: remove cache_gate for direct RDMA read, fix cost model

The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).

Also fixed cost model: offload_cost now reflects direct read reality:
  OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
  NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)

Offload wins when C_s queue > RDMA_overhead (~2s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 22:01:43 +08:00
parent 23788f7cd5
commit 5c66f500fc

View File

@@ -380,27 +380,21 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
remaining = [c for c in combined_instances if c is not best_inst and c is not p_candidate]
d_candidate = min(remaining, key=lambda c: c.ongoing_tokens) if remaining else p_candidate
# Cost model: compare co-located vs offload expected latency
# Cost model: compare co-located vs direct-RDMA-read offload
# Co-located: queue on C_s + prefill new tokens on C_s
cs_queue = best_inst.pending_prefill_tokens / SETTINGS.prefill_throughput
colocated_cost = cs_queue + estimated_new / SETTINGS.prefill_throughput
# Offload: prefill on P (may or may not have cache) + RDMA + decode start
p_queue = p_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
p_cache_hit = p_candidate.estimate_cache_hit(token_ids) if token_ids else 0
p_new_tokens = max(0, input_length - p_cache_hit)
offload_cost = p_queue + p_new_tokens / SETTINGS.prefill_throughput + SETTINGS.rdma_overhead_s
# Direct RDMA read: D reads C_s's cached blocks via RDMA + D prefills new tokens locally
# D's queue + RDMA read time + D local prefill of new tokens only
d_queue = d_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
offload_cost = d_queue + SETTINGS.rdma_overhead_s + estimated_new / SETTINGS.prefill_throughput
breakdown["cache_ratio"] = cache_ratio
breakdown["colocated_cost"] = round(colocated_cost, 2)
breakdown["offload_cost"] = round(offload_cost, 2)
# H4 cache-ratio gate: if C_s does not have a meaningful cached prefix,
# offload pays full RDMA without saving prefill compute, so block it.
# Set --cache-gate-ratio 0.0 to disable, 1.0 to never offload.
if cache_ratio < SETTINGS.cache_gate_ratio:
offload_reason = "cache_gate_%.2f<%.2f" % (cache_ratio, SETTINGS.cache_gate_ratio)
elif current_offloads >= SETTINGS.max_offload_inflight:
if current_offloads >= SETTINGS.max_offload_inflight:
offload_reason = "cap_reached_%d" % current_offloads
elif offload_cost < colocated_cost:
use_offload = True