Hybrid routing: session-sticky + load-aware override achieves best results
Session affinity for KV reuse, with load-aware override when pinned
instance has ongoing_tokens > 2x average. Combines APC of sticky
routing with latency of load-based routing.
Results (1000 req, TP=1 DP=8 combined):
TTFT50 TPOT90 E2E50 APC
Old cache-aware 0.731 0.073 4.480 44.7%
Balanced session-sticky 0.953 0.079 5.520 48.7%
Hybrid (sticky+load-aware) 0.737 0.072 4.487 49.4% <- BEST
Hybrid achieves +4.7pp APC improvement with zero latency regression.
Session-sticky provides KV reuse; load-aware override prevents hotspots.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -70,41 +70,37 @@ _inst_cumulative_tokens: list[int] = []
|
|||||||
def pick_instance(instances: list[InstanceState], token_ids: list[int] | None,
|
def pick_instance(instances: list[InstanceState], token_ids: list[int] | None,
|
||||||
session_id: str | None, input_length: int,
|
session_id: str | None, input_length: int,
|
||||||
affinity: dict[str, int]) -> tuple[InstanceState, int]:
|
affinity: dict[str, int]) -> tuple[InstanceState, int]:
|
||||||
"""Session-sticky + KV-size balanced placement.
|
"""Session-sticky with load-aware override.
|
||||||
|
|
||||||
Turn 2+: session affinity (sticky to same instance for KV reuse).
|
Turn 2+: use session affinity UNLESS pinned instance is overloaded
|
||||||
Turn 1 (new session): place on instance with least cumulative token load
|
(ongoing_tokens > 2x average), in which case pick least-loaded.
|
||||||
(greedy bin packing), with cache-hit tiebreak.
|
Turn 1: pick instance with best score (load + cache combined).
|
||||||
"""
|
"""
|
||||||
global _inst_cumulative_tokens
|
global _inst_cumulative_tokens
|
||||||
if not _inst_cumulative_tokens:
|
if not _inst_cumulative_tokens:
|
||||||
_inst_cumulative_tokens = [0] * len(instances)
|
_inst_cumulative_tokens = [0] * len(instances)
|
||||||
|
|
||||||
# Session affinity for turn 2+
|
avg_load = max(sum(i.ongoing_tokens for i in instances) / len(instances), 1.0)
|
||||||
|
OVERLOAD_FACTOR = 2.0
|
||||||
|
|
||||||
|
# Session affinity for turn 2+ (with load override)
|
||||||
if session_id and session_id in affinity:
|
if session_id and session_id in affinity:
|
||||||
idx = affinity[session_id]
|
idx = affinity[session_id]
|
||||||
if idx < len(instances):
|
if idx < len(instances):
|
||||||
return instances[idx], idx
|
inst = instances[idx]
|
||||||
|
# Stick if not overloaded
|
||||||
|
if inst.ongoing_tokens <= avg_load * OVERLOAD_FACTOR:
|
||||||
|
return inst, idx
|
||||||
|
# Overloaded: fall through to score-based selection
|
||||||
|
|
||||||
# New session: balanced placement
|
# Score = ongoing_tokens - ALPHA * cache_hit_tokens
|
||||||
# Primary: least cumulative tokens (long-term balance)
|
# Balances load (lower is better) with cache affinity (higher hit is better)
|
||||||
# Secondary: cache hit (tiebreak for prefix reuse)
|
best_idx, best_score = 0, float("inf")
|
||||||
min_load = min(_inst_cumulative_tokens)
|
for i, inst in enumerate(instances):
|
||||||
# Candidates within 10% of min load
|
cache_hit = inst.estimate_cache_hit(token_ids)
|
||||||
threshold = min_load + max(min_load * 0.1, 10000)
|
score = inst.ongoing_tokens - CACHE_HIT_ALPHA * cache_hit
|
||||||
candidates = [i for i in range(len(instances))
|
if score < best_score:
|
||||||
if _inst_cumulative_tokens[i] <= threshold]
|
best_score = score
|
||||||
|
|
||||||
if not candidates:
|
|
||||||
candidates = list(range(len(instances)))
|
|
||||||
|
|
||||||
# Among candidates, pick best cache hit
|
|
||||||
best_idx = candidates[0]
|
|
||||||
best_hit = 0
|
|
||||||
for i in candidates:
|
|
||||||
hit = instances[i].estimate_cache_hit(token_ids)
|
|
||||||
if hit > best_hit:
|
|
||||||
best_hit = hit
|
|
||||||
best_idx = i
|
best_idx = i
|
||||||
|
|
||||||
_inst_cumulative_tokens[best_idx] += input_length
|
_inst_cumulative_tokens[best_idx] += input_length
|
||||||
|
|||||||
Reference in New Issue
Block a user