Files

Gahow Wang 08530b3915 B3 policies: pseudocode reference for the five-policy sweep

Documents each pick_instance_* function from cache_aware_proxy.py in
pseudocode so the policy semantics can be cited without re-reading
implementation details. Covers lmetric (main baseline), load_only
(no cache / no affinity control), sticky (hard affinity control),
unified (gated affinity + LMetric fallback), and capped (lmetric on
a per-session turn-capped trace).

Includes a decision matrix that maps each policy to whether it uses
session affinity, cache awareness, load awareness, and overload
break, plus a one-liner per control explaining what comparison
isolates which factor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 19:57:02 +08:00

6.1 KiB

Raw Permalink Blame History

B3 Routing Policies — Pseudocode

Reference: scripts/cache_aware_proxy.py. All five policies share the same per-worker state machine; only the per-request pick_instance_* function differs.

Shared per-instance state

inst.url
inst.ongoing_tokens        # sum of input_length across in-flight reqs
inst.pending_prefill_tokens
inst.ongoing_decode_tokens
inst.num_requests          # waiting + running
inst.cached_blocks         # LRU set of 512-token block hashes
inst.estimate_cache_hit(tokens) -> int
                           # longest prefix of `tokens` (in BLOCK_SIZE
                           # chunks) currently in cached_blocks

Each pick is one round-trip on every routing decision; counters are mutated when a request starts/finishes, not inside the picker.

1. `lmetric` — main baseline

Pure per-request LMetric scoring. No session affinity, no overload-break logic.

def pick_lmetric(instances, token_ids, input_length):
    best, best_score = None, +inf
    for inst in instances:
        cache_hit   = inst.estimate_cache_hit(token_ids)
        new_prefill = max(0, input_length - cache_hit)
        p_tokens    = inst.pending_prefill_tokens + new_prefill
        bs          = inst.num_requests
        score       = p_tokens * bs
        if score < best_score:
            best, best_score = inst, score
    return best

Intuition: prefer the instance where the expected new prefill cost times the running batch size is smallest. Cache hit reduces new_prefill, so cache-warm workers win at equal load.

2. `load_only` — B3 control (no cache, no affinity)

def pick_load_only(instances):
    return min(instances, key=lambda inst: inst.num_requests)

Ties: Python min returns the first-seen, so when num_requests is equal across all instances (e.g. fresh start), pick always lands on instances[0]. This produces unintentional stickiness at low concurrency — the B3 lmetric/load_only comparison reads APC=54.1% for load_only partly because of that.

3. `sticky` — B3 control (hard affinity)

Once a session is bound, never break the binding under any load.

def pick_sticky(instances, session_id, affinity):
    if session_id in affinity:
        return instances[affinity[session_id]]  # unconditional
    chosen = min(instances, key=lambda i: i.num_requests)
    affinity[session_id] = index_of(chosen)
    return chosen

This is the upper bound on locality and the worst case on hot-spot behavior — a single heavy session pins one worker forever.

4. `unified` — hybrid affinity + LMetric fallback

Sticks to the affinity worker only when the cache is genuinely warm and the affinity worker is not overloaded; otherwise falls back to LMetric with a 4-key tie-breaker.

def pick_unified(instances, token_ids, input_length, session_id, affinity):
    avg_reqs = max(mean(inst.num_requests for inst in instances), 1.0)

    # Affinity gate (both must hold)
    if session_id in affinity:
        a = instances[affinity[session_id]]
        a_hit_ratio = a.estimate_cache_hit(token_ids) / max(input_length, 1)
        if a_hit_ratio > 0.5 \
                and a.num_requests <= avg_reqs * OVERLOAD_FACTOR:
            return a   # stick

    # LMetric fallback with multi-key tie-break
    keys = []
    for inst in instances:
        cache_hit   = inst.estimate_cache_hit(token_ids)
        new_prefill = max(0, input_length - cache_hit)
        p_tokens    = inst.pending_prefill_tokens + new_prefill
        bs          = inst.num_requests
        score       = p_tokens * bs
        keys.append((score, new_prefill, bs, idx_of(inst)))

    best_3tuple = min(k[:3] for k in keys)
    tied = [k for k in keys if k[:3] == best_3tuple]
    if len(tied) > 1:
        # Round-robin among ties so brand-new traffic doesn't pin
        # instance 0 when BS=0 across the board.
        winner = tied[_rr_counter % len(tied)]
        _rr_counter += 1
    else:
        winner = tied[0]
    return instances[winner.idx]

Tie-break ordering: score (LMetric primary), then new_prefill (prefer the most cache-warm at equal score), then num_requests (prefer least-loaded), then a global round-robin counter.

OVERLOAD_FACTOR defaults to 2.0; when the affinity worker is above 2× average load, the picker treats it as overloaded and steers away.

5. `capped` — `lmetric` on a session-mass-capped trace

Not a new picker. The picker is the same pick_lmetric from §1; the input trace is preprocessed.

def build_capped_trace(input_path, output_path, MAX_TURNS=8):
    by_session = group_by_session_id(load(input_path))
    capped = []
    for sid, turns in by_session.items():
        turns.sort_by(lambda t: (t.turn_id, t.timestamp))
        capped.extend(turns[:MAX_TURNS])
    capped.sort_by(timestamp)        # restore wall-clock order
    write_jsonl(capped, output_path)

# At run time:
trace      = build_capped_trace("w600_r0.0015_st30.jsonl")
picker     = pick_lmetric

For this trace MAX_TURNS=8 truncates the heavy-tail sessions (full trace turns/session p90=1, p99=18, max=3091). The intent is to isolate "what does LMetric look like when no session is heavy enough to hot-spot a worker?" — comparing capped vs lmetric is the session-mass ablation.

Decision matrix

	session affinity	cache aware	load aware	overload break
`lmetric`	✗	✓ (via `cache_hit` → `new_prefill`)	✓ (`num_requests` BS factor)	n/a
`load_only`	✗	✗	✓ (`num_requests` only)	n/a
`sticky`	✓ (hard)	✗ (relies on physical hits, not scoring)	only on first turn	never
`unified`	✓ (gated)	✓	✓	gate: `cache_ratio>0.5` AND `num_req ≤ 2× avg`
`capped`	same as `lmetric`; the trace itself is truncated

What each control isolates

lmetric vs load_only → contribution of cache awareness alone.
lmetric vs sticky → contribution of session affinity vs per-request LMetric scoring at the cost of hot-spot.
lmetric vs unified → did adding gated session affinity help?
lmetric vs capped → how much of the residual hot-spot in lmetric is driven by heavy-tail sessions specifically?

6.1 KiB Raw Permalink Blame History Unescape Escape

B3 Routing Policies — Pseudocode

Shared per-instance state

1. lmetric — main baseline

2. load_only — B3 control (no cache, no affinity)

3. sticky — B3 control (hard affinity)

4. unified — hybrid affinity + LMetric fallback

5. capped — lmetric on a session-mass-capped trace