Fix offload gate: remove cache_gate for direct RDMA read, fix cost model

The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%) because they were cold (cache_ratio=0). But with direct RDMA read, D reads C's cached blocks via RDMA regardless of cache ratio — the gate was protecting against the OLD flow (C does prefill + push). Also fixed cost model: offload_cost now reflects direct read reality: OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive) NEW: D_queue + RDMA_read + D_local_prefill(new_tokens) Offload wins when C_s queue > RDMA_overhead (~2s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:01:43 +08:00
parent 23788f7cd5
commit 5c66f500fc
1 changed files with 6 additions and 12 deletions
--- a/scripts/cache_aware_proxy.py
+++ b/scripts/cache_aware_proxy.py
@@ -380,27 +380,21 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
        remaining = [c for c in combined_instances if c is not best_inst and c is not p_candidate]
        d_candidate = min(remaining, key=lambda c: c.ongoing_tokens) if remaining else p_candidate

-        # Cost model: compare co-located vs offload expected latency
+        # Cost model: compare co-located vs direct-RDMA-read offload
        # Co-located: queue on C_s + prefill new tokens on C_s
        cs_queue = best_inst.pending_prefill_tokens / SETTINGS.prefill_throughput
        colocated_cost = cs_queue + estimated_new / SETTINGS.prefill_throughput

-        # Offload: prefill on P (may or may not have cache) + RDMA + decode start
-        p_queue = p_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
-        p_cache_hit = p_candidate.estimate_cache_hit(token_ids) if token_ids else 0
-        p_new_tokens = max(0, input_length - p_cache_hit)
-        offload_cost = p_queue + p_new_tokens / SETTINGS.prefill_throughput + SETTINGS.rdma_overhead_s
+        # Direct RDMA read: D reads C_s's cached blocks via RDMA + D prefills new tokens locally
+        # D's queue + RDMA read time + D local prefill of new tokens only
+        d_queue = d_candidate.pending_prefill_tokens / SETTINGS.prefill_throughput
+        offload_cost = d_queue + SETTINGS.rdma_overhead_s + estimated_new / SETTINGS.prefill_throughput

        breakdown["cache_ratio"] = cache_ratio
        breakdown["colocated_cost"] = round(colocated_cost, 2)
        breakdown["offload_cost"] = round(offload_cost, 2)

-        # H4 cache-ratio gate: if C_s does not have a meaningful cached prefix,
-        # offload pays full RDMA without saving prefill compute, so block it.
-        # Set --cache-gate-ratio 0.0 to disable, 1.0 to never offload.
-        if cache_ratio < SETTINGS.cache_gate_ratio:
-            offload_reason = "cache_gate_%.2f<%.2f" % (cache_ratio, SETTINGS.cache_gate_ratio)
-        elif current_offloads >= SETTINGS.max_offload_inflight:
+        if current_offloads >= SETTINGS.max_offload_inflight:
            offload_reason = "cap_reached_%d" % current_offloads
        elif offload_cost < colocated_cost:
            use_offload = True