B3 policies: pseudocode reference for the five-policy sweep
Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
167
analysis/characterization/b3_policies_pseudocode.md
Normal file
167
analysis/characterization/b3_policies_pseudocode.md
Normal file
@@ -0,0 +1,167 @@
|
||||
# B3 Routing Policies — Pseudocode
|
||||
|
||||
Reference: `scripts/cache_aware_proxy.py`. All five policies share the
|
||||
same per-worker state machine; only the per-request `pick_instance_*`
|
||||
function differs.
|
||||
|
||||
## Shared per-instance state
|
||||
|
||||
```text
|
||||
inst.url
|
||||
inst.ongoing_tokens # sum of input_length across in-flight reqs
|
||||
inst.pending_prefill_tokens
|
||||
inst.ongoing_decode_tokens
|
||||
inst.num_requests # waiting + running
|
||||
inst.cached_blocks # LRU set of 512-token block hashes
|
||||
inst.estimate_cache_hit(tokens) -> int
|
||||
# longest prefix of `tokens` (in BLOCK_SIZE
|
||||
# chunks) currently in cached_blocks
|
||||
```
|
||||
|
||||
Each pick is one round-trip on every routing decision; counters are
|
||||
mutated when a request starts/finishes, not inside the picker.
|
||||
|
||||
## 1. `lmetric` — main baseline
|
||||
|
||||
Pure per-request LMetric scoring. No session affinity, no
|
||||
overload-break logic.
|
||||
|
||||
```text
|
||||
def pick_lmetric(instances, token_ids, input_length):
|
||||
best, best_score = None, +inf
|
||||
for inst in instances:
|
||||
cache_hit = inst.estimate_cache_hit(token_ids)
|
||||
new_prefill = max(0, input_length - cache_hit)
|
||||
p_tokens = inst.pending_prefill_tokens + new_prefill
|
||||
bs = inst.num_requests
|
||||
score = p_tokens * bs
|
||||
if score < best_score:
|
||||
best, best_score = inst, score
|
||||
return best
|
||||
```
|
||||
|
||||
Intuition: prefer the instance where the expected new prefill cost
|
||||
times the running batch size is smallest. Cache hit reduces
|
||||
`new_prefill`, so cache-warm workers win at equal load.
|
||||
|
||||
## 2. `load_only` — B3 control (no cache, no affinity)
|
||||
|
||||
```text
|
||||
def pick_load_only(instances):
|
||||
return min(instances, key=lambda inst: inst.num_requests)
|
||||
```
|
||||
|
||||
Ties: Python `min` returns the first-seen, so when `num_requests` is
|
||||
equal across all instances (e.g. fresh start), pick always lands on
|
||||
`instances[0]`. This produces unintentional stickiness at low
|
||||
concurrency — the B3 lmetric/load_only comparison reads APC=54.1%
|
||||
for load_only partly because of that.
|
||||
|
||||
## 3. `sticky` — B3 control (hard affinity)
|
||||
|
||||
Once a session is bound, never break the binding under any load.
|
||||
|
||||
```text
|
||||
def pick_sticky(instances, session_id, affinity):
|
||||
if session_id in affinity:
|
||||
return instances[affinity[session_id]] # unconditional
|
||||
chosen = min(instances, key=lambda i: i.num_requests)
|
||||
affinity[session_id] = index_of(chosen)
|
||||
return chosen
|
||||
```
|
||||
|
||||
This is the upper bound on locality and the worst case on hot-spot
|
||||
behavior — a single heavy session pins one worker forever.
|
||||
|
||||
## 4. `unified` — hybrid affinity + LMetric fallback
|
||||
|
||||
Sticks to the affinity worker only when the cache is genuinely warm
|
||||
and the affinity worker is not overloaded; otherwise falls back to
|
||||
LMetric with a 4-key tie-breaker.
|
||||
|
||||
```text
|
||||
def pick_unified(instances, token_ids, input_length, session_id, affinity):
|
||||
avg_reqs = max(mean(inst.num_requests for inst in instances), 1.0)
|
||||
|
||||
# Affinity gate (both must hold)
|
||||
if session_id in affinity:
|
||||
a = instances[affinity[session_id]]
|
||||
a_hit_ratio = a.estimate_cache_hit(token_ids) / max(input_length, 1)
|
||||
if a_hit_ratio > 0.5 \
|
||||
and a.num_requests <= avg_reqs * OVERLOAD_FACTOR:
|
||||
return a # stick
|
||||
|
||||
# LMetric fallback with multi-key tie-break
|
||||
keys = []
|
||||
for inst in instances:
|
||||
cache_hit = inst.estimate_cache_hit(token_ids)
|
||||
new_prefill = max(0, input_length - cache_hit)
|
||||
p_tokens = inst.pending_prefill_tokens + new_prefill
|
||||
bs = inst.num_requests
|
||||
score = p_tokens * bs
|
||||
keys.append((score, new_prefill, bs, idx_of(inst)))
|
||||
|
||||
best_3tuple = min(k[:3] for k in keys)
|
||||
tied = [k for k in keys if k[:3] == best_3tuple]
|
||||
if len(tied) > 1:
|
||||
# Round-robin among ties so brand-new traffic doesn't pin
|
||||
# instance 0 when BS=0 across the board.
|
||||
winner = tied[_rr_counter % len(tied)]
|
||||
_rr_counter += 1
|
||||
else:
|
||||
winner = tied[0]
|
||||
return instances[winner.idx]
|
||||
```
|
||||
|
||||
Tie-break ordering: `score` (LMetric primary), then `new_prefill`
|
||||
(prefer the most cache-warm at equal score), then `num_requests`
|
||||
(prefer least-loaded), then a global round-robin counter.
|
||||
|
||||
`OVERLOAD_FACTOR` defaults to 2.0; when the affinity worker is
|
||||
above 2× average load, the picker treats it as overloaded and steers
|
||||
away.
|
||||
|
||||
## 5. `capped` — `lmetric` on a session-mass-capped trace
|
||||
|
||||
Not a new picker. The picker is the same `pick_lmetric` from §1; the
|
||||
input trace is preprocessed.
|
||||
|
||||
```text
|
||||
def build_capped_trace(input_path, output_path, MAX_TURNS=8):
|
||||
by_session = group_by_session_id(load(input_path))
|
||||
capped = []
|
||||
for sid, turns in by_session.items():
|
||||
turns.sort_by(lambda t: (t.turn_id, t.timestamp))
|
||||
capped.extend(turns[:MAX_TURNS])
|
||||
capped.sort_by(timestamp) # restore wall-clock order
|
||||
write_jsonl(capped, output_path)
|
||||
|
||||
# At run time:
|
||||
trace = build_capped_trace("w600_r0.0015_st30.jsonl")
|
||||
picker = pick_lmetric
|
||||
```
|
||||
|
||||
For this trace `MAX_TURNS=8` truncates the heavy-tail sessions (full
|
||||
trace turns/session p90=1, p99=18, max=3091). The intent is to
|
||||
isolate "what does LMetric look like when no session is heavy
|
||||
enough to hot-spot a worker?" — comparing capped vs lmetric is the
|
||||
session-mass ablation.
|
||||
|
||||
## Decision matrix
|
||||
|
||||
| | session affinity | cache aware | load aware | overload break |
|
||||
|---|---|---|---|---|
|
||||
| `lmetric` | ✗ | ✓ (via `cache_hit` → `new_prefill`) | ✓ (`num_requests` BS factor) | n/a |
|
||||
| `load_only` | ✗ | ✗ | ✓ (`num_requests` only) | n/a |
|
||||
| `sticky` | ✓ (hard) | ✗ (relies on physical hits, not scoring) | only on first turn | **never** |
|
||||
| `unified` | ✓ (gated) | ✓ | ✓ | gate: `cache_ratio>0.5` AND `num_req ≤ 2× avg` |
|
||||
| `capped` | same as `lmetric`; the trace itself is truncated | | | |
|
||||
|
||||
## What each control isolates
|
||||
|
||||
- `lmetric` vs `load_only` → contribution of cache awareness alone.
|
||||
- `lmetric` vs `sticky` → contribution of session affinity vs
|
||||
per-request LMetric scoring at the cost of hot-spot.
|
||||
- `lmetric` vs `unified` → did adding gated session affinity help?
|
||||
- `lmetric` vs `capped` → how much of the residual hot-spot in
|
||||
`lmetric` is driven by heavy-tail sessions specifically?
|
||||
Reference in New Issue
Block a user