Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.1 KiB
B3 Routing Policies — Pseudocode
Reference: scripts/cache_aware_proxy.py. All five policies share the
same per-worker state machine; only the per-request pick_instance_*
function differs.
Shared per-instance state
inst.url
inst.ongoing_tokens # sum of input_length across in-flight reqs
inst.pending_prefill_tokens
inst.ongoing_decode_tokens
inst.num_requests # waiting + running
inst.cached_blocks # LRU set of 512-token block hashes
inst.estimate_cache_hit(tokens) -> int
# longest prefix of `tokens` (in BLOCK_SIZE
# chunks) currently in cached_blocks
Each pick is one round-trip on every routing decision; counters are mutated when a request starts/finishes, not inside the picker.
1. lmetric — main baseline
Pure per-request LMetric scoring. No session affinity, no overload-break logic.
def pick_lmetric(instances, token_ids, input_length):
best, best_score = None, +inf
for inst in instances:
cache_hit = inst.estimate_cache_hit(token_ids)
new_prefill = max(0, input_length - cache_hit)
p_tokens = inst.pending_prefill_tokens + new_prefill
bs = inst.num_requests
score = p_tokens * bs
if score < best_score:
best, best_score = inst, score
return best
Intuition: prefer the instance where the expected new prefill cost
times the running batch size is smallest. Cache hit reduces
new_prefill, so cache-warm workers win at equal load.
2. load_only — B3 control (no cache, no affinity)
def pick_load_only(instances):
return min(instances, key=lambda inst: inst.num_requests)
Ties: Python min returns the first-seen, so when num_requests is
equal across all instances (e.g. fresh start), pick always lands on
instances[0]. This produces unintentional stickiness at low
concurrency — the B3 lmetric/load_only comparison reads APC=54.1%
for load_only partly because of that.
3. sticky — B3 control (hard affinity)
Once a session is bound, never break the binding under any load.
def pick_sticky(instances, session_id, affinity):
if session_id in affinity:
return instances[affinity[session_id]] # unconditional
chosen = min(instances, key=lambda i: i.num_requests)
affinity[session_id] = index_of(chosen)
return chosen
This is the upper bound on locality and the worst case on hot-spot behavior — a single heavy session pins one worker forever.
4. unified — hybrid affinity + LMetric fallback
Sticks to the affinity worker only when the cache is genuinely warm and the affinity worker is not overloaded; otherwise falls back to LMetric with a 4-key tie-breaker.
def pick_unified(instances, token_ids, input_length, session_id, affinity):
avg_reqs = max(mean(inst.num_requests for inst in instances), 1.0)
# Affinity gate (both must hold)
if session_id in affinity:
a = instances[affinity[session_id]]
a_hit_ratio = a.estimate_cache_hit(token_ids) / max(input_length, 1)
if a_hit_ratio > 0.5 \
and a.num_requests <= avg_reqs * OVERLOAD_FACTOR:
return a # stick
# LMetric fallback with multi-key tie-break
keys = []
for inst in instances:
cache_hit = inst.estimate_cache_hit(token_ids)
new_prefill = max(0, input_length - cache_hit)
p_tokens = inst.pending_prefill_tokens + new_prefill
bs = inst.num_requests
score = p_tokens * bs
keys.append((score, new_prefill, bs, idx_of(inst)))
best_3tuple = min(k[:3] for k in keys)
tied = [k for k in keys if k[:3] == best_3tuple]
if len(tied) > 1:
# Round-robin among ties so brand-new traffic doesn't pin
# instance 0 when BS=0 across the board.
winner = tied[_rr_counter % len(tied)]
_rr_counter += 1
else:
winner = tied[0]
return instances[winner.idx]
Tie-break ordering: score (LMetric primary), then new_prefill
(prefer the most cache-warm at equal score), then num_requests
(prefer least-loaded), then a global round-robin counter.
OVERLOAD_FACTOR defaults to 2.0; when the affinity worker is
above 2× average load, the picker treats it as overloaded and steers
away.
5. capped — lmetric on a session-mass-capped trace
Not a new picker. The picker is the same pick_lmetric from §1; the
input trace is preprocessed.
def build_capped_trace(input_path, output_path, MAX_TURNS=8):
by_session = group_by_session_id(load(input_path))
capped = []
for sid, turns in by_session.items():
turns.sort_by(lambda t: (t.turn_id, t.timestamp))
capped.extend(turns[:MAX_TURNS])
capped.sort_by(timestamp) # restore wall-clock order
write_jsonl(capped, output_path)
# At run time:
trace = build_capped_trace("w600_r0.0015_st30.jsonl")
picker = pick_lmetric
For this trace MAX_TURNS=8 truncates the heavy-tail sessions (full
trace turns/session p90=1, p99=18, max=3091). The intent is to
isolate "what does LMetric look like when no session is heavy
enough to hot-spot a worker?" — comparing capped vs lmetric is the
session-mass ablation.
Decision matrix
| session affinity | cache aware | load aware | overload break | |
|---|---|---|---|---|
lmetric |
✗ | ✓ (via cache_hit → new_prefill) |
✓ (num_requests BS factor) |
n/a |
load_only |
✗ | ✗ | ✓ (num_requests only) |
n/a |
sticky |
✓ (hard) | ✗ (relies on physical hits, not scoring) | only on first turn | never |
unified |
✓ (gated) | ✓ | ✓ | gate: cache_ratio>0.5 AND num_req ≤ 2× avg |
capped |
same as lmetric; the trace itself is truncated |
What each control isolates
lmetricvsload_only→ contribution of cache awareness alone.lmetricvssticky→ contribution of session affinity vs per-request LMetric scoring at the cost of hot-spot.lmetricvsunified→ did adding gated session affinity help?lmetricvscapped→ how much of the residual hot-spot inlmetricis driven by heavy-tail sessions specifically?