B3 policies: pseudocode reference for the five-policy sweep

Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 19:57:02 +08:00
parent 123a74a4b9
commit 08530b3915
1 changed files with 167 additions and 0 deletions
--- a/analysis/characterization/b3_policies_pseudocode.md
+++ b/analysis/characterization/b3_policies_pseudocode.md
@@ -0,0 +1,167 @@
+# B3 Routing Policies — Pseudocode
+
+Reference: `scripts/cache_aware_proxy.py`. All five policies share the
+same per-worker state machine; only the per-request `pick_instance_*`
+function differs.
+
+## Shared per-instance state
+
+```text
+inst.url
+inst.ongoing_tokens        # sum of input_length across in-flight reqs
+inst.pending_prefill_tokens
+inst.ongoing_decode_tokens
+inst.num_requests          # waiting + running
+inst.cached_blocks         # LRU set of 512-token block hashes
+inst.estimate_cache_hit(tokens) -> int
+                           # longest prefix of `tokens` (in BLOCK_SIZE
+                           # chunks) currently in cached_blocks
+```
+
+Each pick is one round-trip on every routing decision; counters are
+mutated when a request starts/finishes, not inside the picker.
+
+## 1. `lmetric` — main baseline
+
+Pure per-request LMetric scoring. No session affinity, no
+overload-break logic.
+
+```text
+def pick_lmetric(instances, token_ids, input_length):
+    best, best_score = None, +inf
+    for inst in instances:
+        cache_hit   = inst.estimate_cache_hit(token_ids)
+        new_prefill = max(0, input_length - cache_hit)
+        p_tokens    = inst.pending_prefill_tokens + new_prefill
+        bs          = inst.num_requests
+        score       = p_tokens * bs
+        if score < best_score:
+            best, best_score = inst, score
+    return best
+```
+
+Intuition: prefer the instance where the expected new prefill cost
+times the running batch size is smallest. Cache hit reduces
+`new_prefill`, so cache-warm workers win at equal load.
+
+## 2. `load_only` — B3 control (no cache, no affinity)
+
+```text
+def pick_load_only(instances):
+    return min(instances, key=lambda inst: inst.num_requests)
+```
+
+Ties: Python `min` returns the first-seen, so when `num_requests` is
+equal across all instances (e.g. fresh start), pick always lands on
+`instances[0]`. This produces unintentional stickiness at low
+concurrency — the B3 lmetric/load_only comparison reads APC=54.1%
+for load_only partly because of that.
+
+## 3. `sticky` — B3 control (hard affinity)
+
+Once a session is bound, never break the binding under any load.
+
+```text
+def pick_sticky(instances, session_id, affinity):
+    if session_id in affinity:
+        return instances[affinity[session_id]]  # unconditional
+    chosen = min(instances, key=lambda i: i.num_requests)
+    affinity[session_id] = index_of(chosen)
+    return chosen
+```
+
+This is the upper bound on locality and the worst case on hot-spot
+behavior — a single heavy session pins one worker forever.
+
+## 4. `unified` — hybrid affinity + LMetric fallback
+
+Sticks to the affinity worker only when the cache is genuinely warm
+and the affinity worker is not overloaded; otherwise falls back to
+LMetric with a 4-key tie-breaker.
+
+```text
+def pick_unified(instances, token_ids, input_length, session_id, affinity):
+    avg_reqs = max(mean(inst.num_requests for inst in instances), 1.0)
+
+    # Affinity gate (both must hold)
+    if session_id in affinity:
+        a = instances[affinity[session_id]]
+        a_hit_ratio = a.estimate_cache_hit(token_ids) / max(input_length, 1)
+        if a_hit_ratio > 0.5 \
+                and a.num_requests <= avg_reqs * OVERLOAD_FACTOR:
+            return a   # stick
+
+    # LMetric fallback with multi-key tie-break
+    keys = []
+    for inst in instances:
+        cache_hit   = inst.estimate_cache_hit(token_ids)
+        new_prefill = max(0, input_length - cache_hit)
+        p_tokens    = inst.pending_prefill_tokens + new_prefill
+        bs          = inst.num_requests
+        score       = p_tokens * bs
+        keys.append((score, new_prefill, bs, idx_of(inst)))
+
+    best_3tuple = min(k[:3] for k in keys)
+    tied = [k for k in keys if k[:3] == best_3tuple]
+    if len(tied) > 1:
+        # Round-robin among ties so brand-new traffic doesn't pin
+        # instance 0 when BS=0 across the board.
+        winner = tied[_rr_counter % len(tied)]
+        _rr_counter += 1
+    else:
+        winner = tied[0]
+    return instances[winner.idx]
+```
+
+Tie-break ordering: `score` (LMetric primary), then `new_prefill`
+(prefer the most cache-warm at equal score), then `num_requests`
+(prefer least-loaded), then a global round-robin counter.
+
+`OVERLOAD_FACTOR` defaults to 2.0; when the affinity worker is
+above 2× average load, the picker treats it as overloaded and steers
+away.
+
+## 5. `capped` — `lmetric` on a session-mass-capped trace
+
+Not a new picker. The picker is the same `pick_lmetric` from §1; the
+input trace is preprocessed.
+
+```text
+def build_capped_trace(input_path, output_path, MAX_TURNS=8):
+    by_session = group_by_session_id(load(input_path))
+    capped = []
+    for sid, turns in by_session.items():
+        turns.sort_by(lambda t: (t.turn_id, t.timestamp))
+        capped.extend(turns[:MAX_TURNS])
+    capped.sort_by(timestamp)        # restore wall-clock order
+    write_jsonl(capped, output_path)
+
+# At run time:
+trace      = build_capped_trace("w600_r0.0015_st30.jsonl")
+picker     = pick_lmetric
+```
+
+For this trace `MAX_TURNS=8` truncates the heavy-tail sessions (full
+trace turns/session p90=1, p99=18, max=3091). The intent is to
+isolate "what does LMetric look like when no session is heavy
+enough to hot-spot a worker?" — comparing capped vs lmetric is the
+session-mass ablation.
+
+## Decision matrix
+
+|  | session affinity | cache aware | load aware | overload break |
+|---|---|---|---|---|
+| `lmetric` | ✗ | ✓ (via `cache_hit` → `new_prefill`) | ✓ (`num_requests` BS factor) | n/a |
+| `load_only` | ✗ | ✗ | ✓ (`num_requests` only) | n/a |
+| `sticky` | ✓ (hard) | ✗ (relies on physical hits, not scoring) | only on first turn | **never** |
+| `unified` | ✓ (gated) | ✓ | ✓ | gate: `cache_ratio>0.5` AND `num_req ≤ 2× avg` |
+| `capped` | same as `lmetric`; the trace itself is truncated | | | |
+
+## What each control isolates
+
+- `lmetric` vs `load_only` → contribution of cache awareness alone.
+- `lmetric` vs `sticky` → contribution of session affinity vs
+  per-request LMetric scoring at the cost of hot-spot.
+- `lmetric` vs `unified` → did adding gated session affinity help?
+- `lmetric` vs `capped` → how much of the residual hot-spot in
+  `lmetric` is driven by heavy-tail sessions specifically?