Hybrid routing: session-sticky + load-aware override achieves best results

Session affinity for KV reuse, with load-aware override when pinned instance has ongoing_tokens > 2x average. Combines APC of sticky routing with latency of load-based routing. Results (1000 req, TP=1 DP=8 combined): TTFT50 TPOT90 E2E50 APC Old cache-aware 0.731 0.073 4.480 44.7% Balanced session-sticky 0.953 0.079 5.520 48.7% Hybrid (sticky+load-aware) 0.737 0.072 4.487 49.4% <- BEST Hybrid achieves +4.7pp APC improvement with zero latency regression. Session-sticky provides KV reuse; load-aware override prevents hotspots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 02:53:44 +08:00
parent efe984477a
commit 012d73f596
1 changed files with 21 additions and 25 deletions
--- a/scripts/cache_aware_proxy.py
+++ b/scripts/cache_aware_proxy.py
@@ -70,41 +70,37 @@ _inst_cumulative_tokens: list[int] = []
 def pick_instance(instances: list[InstanceState], token_ids: list[int] | None,
                  session_id: str | None, input_length: int,
                  affinity: dict[str, int]) -> tuple[InstanceState, int]:
-    """Session-sticky + KV-size balanced placement.
+    """Session-sticky with load-aware override.
-    Turn 2+: session affinity (sticky to same instance for KV reuse).
+    Turn 2+: use session affinity UNLESS pinned instance is overloaded
-    Turn 1 (new session): place on instance with least cumulative token load
+    (ongoing_tokens > 2x average), in which case pick least-loaded.
-    (greedy bin packing), with cache-hit tiebreak.
+    Turn 1: pick instance with best score (load + cache combined).
    """
    global _inst_cumulative_tokens
    if not _inst_cumulative_tokens:
        _inst_cumulative_tokens = [0] * len(instances)
-    # Session affinity for turn 2+
+    avg_load = max(sum(i.ongoing_tokens for i in instances) / len(instances), 1.0)
    OVERLOAD_FACTOR = 2.0
    # Session affinity for turn 2+ (with load override)
    if session_id and session_id in affinity:
        idx = affinity[session_id]
        if idx < len(instances):
-            return instances[idx], idx
+            inst = instances[idx]
            # Stick if not overloaded
            if inst.ongoing_tokens <= avg_load * OVERLOAD_FACTOR:
                return inst, idx
            # Overloaded: fall through to score-based selection
-    # New session: balanced placement
+    # Score = ongoing_tokens - ALPHA * cache_hit_tokens
-    # Primary: least cumulative tokens (long-term balance)
+    # Balances load (lower is better) with cache affinity (higher hit is better)
-    # Secondary: cache hit (tiebreak for prefix reuse)
+    best_idx, best_score = 0, float("inf")
-    min_load = min(_inst_cumulative_tokens)
+    for i, inst in enumerate(instances):
-    # Candidates within 10% of min load
+        cache_hit = inst.estimate_cache_hit(token_ids)
-    threshold = min_load + max(min_load * 0.1, 10000)
+        score = inst.ongoing_tokens - CACHE_HIT_ALPHA * cache_hit
-    candidates = [i for i in range(len(instances))
+        if score < best_score:
-                  if _inst_cumulative_tokens[i] <= threshold]
+            best_score = score
    if not candidates:
        candidates = list(range(len(instances)))
    # Among candidates, pick best cache hit
    best_idx = candidates[0]
    best_hit = 0
    for i in candidates:
        hit = instances[i].estimate_cache_hit(token_ids)
        if hit > best_hit:
            best_hit = hit
            best_idx = i
    _inst_cumulative_tokens[best_idx] += input_length