B3 policies: pseudocode reference for the five-policy sweep

Documents each pick_instance_* function from cache_aware_proxy.py in
pseudocode so the policy semantics can be cited without re-reading
implementation details. Covers lmetric (main baseline), load_only
(no cache / no affinity control), sticky (hard affinity control),
unified (gated affinity + LMetric fallback), and capped (lmetric on
a per-session turn-capped trace).

Includes a decision matrix that maps each policy to whether it uses
session affinity, cache awareness, load awareness, and overload
break, plus a one-liner per control explaining what comparison
isolates which factor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 19:57:02 +08:00
parent 123a74a4b9
commit 08530b3915

View File

@@ -0,0 +1,167 @@
# B3 Routing Policies — Pseudocode
Reference: `scripts/cache_aware_proxy.py`. All five policies share the
same per-worker state machine; only the per-request `pick_instance_*`
function differs.
## Shared per-instance state
```text
inst.url
inst.ongoing_tokens # sum of input_length across in-flight reqs
inst.pending_prefill_tokens
inst.ongoing_decode_tokens
inst.num_requests # waiting + running
inst.cached_blocks # LRU set of 512-token block hashes
inst.estimate_cache_hit(tokens) -> int
# longest prefix of `tokens` (in BLOCK_SIZE
# chunks) currently in cached_blocks
```
Each pick is one round-trip on every routing decision; counters are
mutated when a request starts/finishes, not inside the picker.
## 1. `lmetric` — main baseline
Pure per-request LMetric scoring. No session affinity, no
overload-break logic.
```text
def pick_lmetric(instances, token_ids, input_length):
best, best_score = None, +inf
for inst in instances:
cache_hit = inst.estimate_cache_hit(token_ids)
new_prefill = max(0, input_length - cache_hit)
p_tokens = inst.pending_prefill_tokens + new_prefill
bs = inst.num_requests
score = p_tokens * bs
if score < best_score:
best, best_score = inst, score
return best
```
Intuition: prefer the instance where the expected new prefill cost
times the running batch size is smallest. Cache hit reduces
`new_prefill`, so cache-warm workers win at equal load.
## 2. `load_only` — B3 control (no cache, no affinity)
```text
def pick_load_only(instances):
return min(instances, key=lambda inst: inst.num_requests)
```
Ties: Python `min` returns the first-seen, so when `num_requests` is
equal across all instances (e.g. fresh start), pick always lands on
`instances[0]`. This produces unintentional stickiness at low
concurrency — the B3 lmetric/load_only comparison reads APC=54.1%
for load_only partly because of that.
## 3. `sticky` — B3 control (hard affinity)
Once a session is bound, never break the binding under any load.
```text
def pick_sticky(instances, session_id, affinity):
if session_id in affinity:
return instances[affinity[session_id]] # unconditional
chosen = min(instances, key=lambda i: i.num_requests)
affinity[session_id] = index_of(chosen)
return chosen
```
This is the upper bound on locality and the worst case on hot-spot
behavior — a single heavy session pins one worker forever.
## 4. `unified` — hybrid affinity + LMetric fallback
Sticks to the affinity worker only when the cache is genuinely warm
and the affinity worker is not overloaded; otherwise falls back to
LMetric with a 4-key tie-breaker.
```text
def pick_unified(instances, token_ids, input_length, session_id, affinity):
avg_reqs = max(mean(inst.num_requests for inst in instances), 1.0)
# Affinity gate (both must hold)
if session_id in affinity:
a = instances[affinity[session_id]]
a_hit_ratio = a.estimate_cache_hit(token_ids) / max(input_length, 1)
if a_hit_ratio > 0.5 \
and a.num_requests <= avg_reqs * OVERLOAD_FACTOR:
return a # stick
# LMetric fallback with multi-key tie-break
keys = []
for inst in instances:
cache_hit = inst.estimate_cache_hit(token_ids)
new_prefill = max(0, input_length - cache_hit)
p_tokens = inst.pending_prefill_tokens + new_prefill
bs = inst.num_requests
score = p_tokens * bs
keys.append((score, new_prefill, bs, idx_of(inst)))
best_3tuple = min(k[:3] for k in keys)
tied = [k for k in keys if k[:3] == best_3tuple]
if len(tied) > 1:
# Round-robin among ties so brand-new traffic doesn't pin
# instance 0 when BS=0 across the board.
winner = tied[_rr_counter % len(tied)]
_rr_counter += 1
else:
winner = tied[0]
return instances[winner.idx]
```
Tie-break ordering: `score` (LMetric primary), then `new_prefill`
(prefer the most cache-warm at equal score), then `num_requests`
(prefer least-loaded), then a global round-robin counter.
`OVERLOAD_FACTOR` defaults to 2.0; when the affinity worker is
above 2× average load, the picker treats it as overloaded and steers
away.
## 5. `capped` — `lmetric` on a session-mass-capped trace
Not a new picker. The picker is the same `pick_lmetric` from §1; the
input trace is preprocessed.
```text
def build_capped_trace(input_path, output_path, MAX_TURNS=8):
by_session = group_by_session_id(load(input_path))
capped = []
for sid, turns in by_session.items():
turns.sort_by(lambda t: (t.turn_id, t.timestamp))
capped.extend(turns[:MAX_TURNS])
capped.sort_by(timestamp) # restore wall-clock order
write_jsonl(capped, output_path)
# At run time:
trace = build_capped_trace("w600_r0.0015_st30.jsonl")
picker = pick_lmetric
```
For this trace `MAX_TURNS=8` truncates the heavy-tail sessions (full
trace turns/session p90=1, p99=18, max=3091). The intent is to
isolate "what does LMetric look like when no session is heavy
enough to hot-spot a worker?" — comparing capped vs lmetric is the
session-mass ablation.
## Decision matrix
| | session affinity | cache aware | load aware | overload break |
|---|---|---|---|---|
| `lmetric` | ✗ | ✓ (via `cache_hit``new_prefill`) | ✓ (`num_requests` BS factor) | n/a |
| `load_only` | ✗ | ✗ | ✓ (`num_requests` only) | n/a |
| `sticky` | ✓ (hard) | ✗ (relies on physical hits, not scoring) | only on first turn | **never** |
| `unified` | ✓ (gated) | ✓ | ✓ | gate: `cache_ratio>0.5` AND `num_req ≤ 2× avg` |
| `capped` | same as `lmetric`; the trace itself is truncated | | | |
## What each control isolates
- `lmetric` vs `load_only` → contribution of cache awareness alone.
- `lmetric` vs `sticky` → contribution of session affinity vs
per-request LMetric scoring at the cost of hot-spot.
- `lmetric` vs `unified` → did adding gated session affinity help?
- `lmetric` vs `capped` → how much of the residual hot-spot in
`lmetric` is driven by heavy-tail sessions specifically?