Fix offload gate: remove cache_gate for direct RDMA read, fix cost model
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%) because they were cold (cache_ratio=0). But with direct RDMA read, D reads C's cached blocks via RDMA regardless of cache ratio — the gate was protecting against the OLD flow (C does prefill + push). Also fixed cost model: offload_cost now reflects direct read reality: OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive) NEW: D_queue + RDMA_read + D_local_prefill(new_tokens) Offload wins when C_s queue > RDMA_overhead (~2s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -380,27 +380,21 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
|
||||
remaining = [c for c in combined_instances if c is not best_inst and c is not p_candidate]
|
||||
d_candidate = min(remaining, key=lambda c: c.ongoing_tokens) if remaining else p_candidate
|
||||
|
||||
# Cost model: compare co-located vs offload expected latency
|
||||
# Cost model: compare co-located vs direct-RDMA-read offload
|
||||
# Co-located: queue on C_s + prefill new tokens on C_s
|
||||
cs_queue = best_inst.pending_prefill_tokens / SETTINGS.prefill_throughput
|
||||
colocated_cost = cs_queue + estimated_new / SETTINGS.prefill_throughput
|
||||
|
||||
# Offload: prefill on P (may or may not have cache) + RDMA + decode start
|
||||
p_queue = p_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
|
||||
p_cache_hit = p_candidate.estimate_cache_hit(token_ids) if token_ids else 0
|
||||
p_new_tokens = max(0, input_length - p_cache_hit)
|
||||
offload_cost = p_queue + p_new_tokens / SETTINGS.prefill_throughput + SETTINGS.rdma_overhead_s
|
||||
# Direct RDMA read: D reads C_s's cached blocks via RDMA + D prefills new tokens locally
|
||||
# D's queue + RDMA read time + D local prefill of new tokens only
|
||||
d_queue = d_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
|
||||
offload_cost = d_queue + SETTINGS.rdma_overhead_s + estimated_new / SETTINGS.prefill_throughput
|
||||
|
||||
breakdown["cache_ratio"] = cache_ratio
|
||||
breakdown["colocated_cost"] = round(colocated_cost, 2)
|
||||
breakdown["offload_cost"] = round(offload_cost, 2)
|
||||
|
||||
# H4 cache-ratio gate: if C_s does not have a meaningful cached prefix,
|
||||
# offload pays full RDMA without saving prefill compute, so block it.
|
||||
# Set --cache-gate-ratio 0.0 to disable, 1.0 to never offload.
|
||||
if cache_ratio < SETTINGS.cache_gate_ratio:
|
||||
offload_reason = "cache_gate_%.2f<%.2f" % (cache_ratio, SETTINGS.cache_gate_ratio)
|
||||
elif current_offloads >= SETTINGS.max_offload_inflight:
|
||||
if current_offloads >= SETTINGS.max_offload_inflight:
|
||||
offload_reason = "cap_reached_%d" % current_offloads
|
||||
elif offload_cost < colocated_cost:
|
||||
use_offload = True
|
||||
|
||||
Reference in New Issue
Block a user