cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
16 KiB
Unified routing ablation: A (tighter affinity) + B (decode-aware LMetric)
Goal: judge whether unified (cache-aware hybrid affinity + LMetric fallback)
has enough headroom to surpass v3 migration-based routing on agentic
workloads, without invoking PD-sep migration.
Workload / baseline
- Trace:
w600_r0.0015_st30.jsonl(1214 reqs, 274 sessions) - Hardware: 8 × H100 (dash0), Qwen3-Coder-30B-A3B, TP=1, max_model_len=200000
- Trace replay through
cache_aware_proxy.pywith policyunified b3_replay_20260527_0114/unified/reference
| Metric (ms) | baseline (overload_factor=2.0) |
|---|---|
| TTFT p50 | 520 |
| TTFT p90 | 8781 |
| TTFT p99 | 47647 |
| TPOT p90 | 17.8 |
| E2E p90 | 19989 |
| E2E p99 | 85841 |
Reference points we're trying to beat / match:
- v3 fixed rotation (cache-blind picker): TTFT p90 = 10828
- v3 + Mechanism B (cache-rich picker): TTFT p90 = 9711
- All v3 variants are +10–23% worse than
unifiedbaseline.
Tail-source diagnostic on baseline
Decision split, baseline unified:
| Decision | n | TTFT mean | TTFT p90 | TTFT p99 |
|---|---|---|---|---|
| affinity | 852 | 3183 | 7011 | 47432 |
| lmetric_fallback | 362 | 4285 | 12083 | 46036 |
Long-tail (>20s, n=65):
- 40 / 65 came from
affinitydecisions - 25 / 65 came from
lmetric_fallback
For the 40 slow affinity reqs:
- only 12 / 40 were actually overloaded at decision time (
aff_num_req > avg_num_req) - overload ratio at decision: mean=0.93, p50=0.87
- most slow affinity reqs looked fine when the picker stuck — load piled on after dispatch.
This is a snapshot-based-routing limitation. Tightening
overload_factor only helps the genuine cases above the new threshold —
expected to be a 5-10% improvement at best.
Direction A — tighten affinity overflow
Hypothesis. overload_factor=2.0 lets the picker stick to affinity
even when affinity.num_req is up to 2× the cluster average. Reducing to
1.3 forces earlier overflow to LMetric fallback, escaping busy affinity
hosts before the tail blows up.
Change. Single CLI flag: --overload-factor 1.3. No code change.
Run. unified_of13_20260527_1532/unified/.
A vs baseline
| Metric (ms) | baseline (of=2.0) | A (of=1.3) | Δ |
|---|---|---|---|
| TTFT p50 | 520 | 495 | −5% |
| TTFT p90 | 8781 | 8730 | ≈0 |
| TTFT p99 | 47647 | 43059 | −10% |
| TPOT p50 | 7.9 | 8.0 | ≈0 |
| TPOT p90 | 17.8 | 15.5 | −13% |
| E2E p50 | 1761 | 1824 | +4% |
| E2E p90 | 19989 | 18407 | −8% |
| E2E p99 | 85841 | 71396 | −17% |
TTFT p90 is essentially unchanged but the deeper tail (p99) and TPOT both improved meaningfully. Net: A alone gives roughly −10% to −17% on the long tail without hurting medians.
Decision split, A vs baseline
| Decision | baseline n / p90 | A n / p90 | Δ p90 |
|---|---|---|---|
| affinity | 852 / 7011 | 817 / 5817 | −17% ✅ |
| lmetric_fallback | 362 / 12083 | 397 / 15360 | +27% ⚠️ |
The picker now sticks to affinity 35 fewer times. The remaining affinity decisions are higher-quality (no longer "barely-fitting" cases), so their p90 drops 17%.
But the 35 extra reqs that got pushed into fallback got slower: fallback p90 went from 12083 → 15360. The LMetric scorer is selecting a worse instance for them.
Per-worker TTFT under A (of=1.3)
port 8000: n= 94 mean=4424 p90=12290 port 8004: n=192 mean=2597 p90=6968
port 8001: n= 135 mean=2779 p90= 5553 port 8005: n=202 mean=3102 p90=6113
port 8002: n= 88 mean=5827 p90=15804 port 8006: n=136 mean=4006 p90=10899
port 8003: n= 217 mean=2674 p90= 4598 port 8007: n=150 mean=3648 p90= 7025
Compared to baseline (88..217 reqs/port), A redistributes more evenly (88..217 still but distribution is fatter in the middle). port 8002 remains slow (p90 15.8s) — its cache pool seems to keep getting cold work routed there by LMetric.
Why A alone isn't enough
LMetric scorer (unified_hybrid fallback path):
score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
This ignores ongoing_decode_tokens entirely. An instance with no
pending prefill but 200k tokens currently in decode looks "ideal"
(score=0×num_req=0) — yet a new request landing there waits behind
slow decode iters caused by the large batch KV reads.
A pushes more requests into fallback, but fallback can't tell which instance is actually free. → Direction B is mandatory companion.
Direction B — decode-aware LMetric
Hypothesis. Adding a decode-load penalty to the LMetric score lets fallback distinguish "no prefill queued but heavy decode running" from "truly idle". Should restore fallback p90 ≤ 12s baseline level.
Change.
score = (pending_prefill + new + lmetric_decode_weight * ongoing_decode_tokens) * num_requests
lmetric_decode_weight=0.0⇒ original LMetric (control)lmetric_decode_weight=0.01⇒ first experiment (rationale: 1 decode token in batch costs ~0.01 prefill-token-equivalent in scheduler iter time on H100 + Qwen3-30B-A3B)
CLI: --lmetric-decode-weight 0.01. Setting in code:
cache_aware_proxy.py:Settings.lmetric_decode_weight.
Run. unified_of13_lmw001_20260527_1628/unified/.
A+B vs baseline / A
| Metric (ms) | baseline | A (of=1.3) | A+B (of=1.3, lmw=0.01) | Δ vs baseline |
|---|---|---|---|---|
| TTFT p50 | 520 | 495 | 514 | −1% |
| TTFT p90 | 8781 | 8730 | 8421 | −4% ✅ |
| TTFT p99 | 47647 | 43059 | 44800 | −6% |
| TPOT p50 | 7.9 | 8.0 | 7.9 | ≈0 |
| TPOT p90 | 17.8 | 15.5 | 15.7 | −12% |
| E2E p50 | 1761 | 1824 | 1870 | +6% |
| E2E p90 | 19989 | 18407 | 21064 | +5% ⚠️ |
| E2E p99 | 85841 | 71396 | 64344 | −25% ✅ |
Long-tail counts:
thresh baseline A A+B v3 MechB
> 5000ms 170 173 170 177
> 10000ms 105 109 109 119
> 20000ms 65 64 59 78
> 30000ms 41 40 37 50
> 50000ms 8 5 6 14
A+B is best on every long-tail-count threshold ≤30s, marginal worse at 50s.
Decision split (A+B vs A)
| Decision | A (of=1.3) | A+B | Note |
|---|---|---|---|
| affinity p90 | 5817 | 5836 | ≈ same |
| fallback p90 | 15360 | 13501 | B recovered some of A's fallback regression |
B partially fixed fallback's selection (−12% on fallback p90 vs A alone), but still worse than baseline (12083).
Per-worker TTFT (A+B)
port 8000: n=134 mean=3495 p90=10967 port 8004: n=136 mean=3102 p90= 7906
port 8001: n=143 mean=2981 p90=10189 port 8005: n=179 mean=1624 p90= 2735
port 8002: n=221 mean=2355 p90= 3502 port 8006: n=137 mean=5356 p90= 9628
port 8003: n=146 mean=3932 p90=10729 port 8007: n=118 mean=5210 p90=26798 ← new hotspot
A+B trades the baseline's 8002 hotspot (p90=35s) for a new 8007 hotspot (p90=26.8s). Lower amplitude but hotspot survives.
Why 8007 became a hotspot under A+B — found a bug in B
8007 in A+B: 118 reqs, 53% affinity / 47% fallback (vs other ports 60–77% affinity), cache_hit_mean=50.5% (lowest).
Top-10 slowest at 8007: all are big-prompt (100k+ tokens) fallback decisions
with cached_tokens=0 (cold prefill). LMetric is pushing many cold-prefill
fallbacks to 8007.
Looking at the B formula:
decode_pen = lmetric_decode_weight * ongoing_decode_tokens
score = (pending_prefill + new + decode_pen) * num_requests # ← BUG
When num_requests = 0, the entire score (including decode penalty) zeros
out. So an idle-but-decoding host (num_req=0 because its last prefill
finished but decode is still running) looks like score=0, beating every
busy host.
Fix (B'): multiply by max(num_requests, 1):
score = (pending_prefill + new + decode_pen) * max(num_requests, 1)
Now idle hosts with high decode load get score = decode_pen × 1 = real nonzero penalty, beating zero-load hosts only when decode is small.
A+B' — re-run with the fix
Run. unified_of13_lmw001_v2_20260527_1724/unified/.
| Metric (ms) | baseline | A+B (BUG) | A+B' (fix) | Δ vs baseline |
|---|---|---|---|---|
| TTFT p50 | 520 | 514 | 485 | −7% |
| TTFT p90 | 8781 | 8421 | 8287 | −5.6% ✅ |
| TTFT p99 | 47647 | 44800 | 41876 | −12% ✅ |
| TPOT p90 | 17.8 | 15.7 | 17.5 | −2% |
| E2E p90 | 19989 | 21064 | 20625 | +3% |
| E2E p99 | 85841 | 64344 | 77827 | −9% |
A+B' best of all variants on TTFT p90 (8287) and TTFT p99 (41876). Long-tail counts (>30s, >50s) also best across variants.
vs v3 reference points:
| TTFT p90 | TPOT p90 | E2E p99 | |
|---|---|---|---|
| A+B' | 8287 | 17.5 | 77827 |
| v3 fixed (cache-blind) | 10828 | 21.0 | 47610 |
| v3 + Mech B | 9711 | 18.3 | 84492 |
A+B' beats v3 Mech B by 15% TTFT p90 with no migration overhead.
Per-worker (A+B' fixed)
8000: n=158 p90= 5688 8004: n=189 p90= 4249
8001: n=159 p90= 7323 8005: n=116 p90=14598
8002: n=114 p90= 8726 8006: n=180 p90= 6198
8003: n=173 p90= 6715 8007: n=125 p90=22242 ← still hot
A+B' redistributed load more evenly (114..189) but 8007 still has p90=22s.
8007 deep-dive in A+B'
8007: n=125, affinity=69 (55%), fallback=56 (45%), cache_hit_mean=lowest
Top-15 slow at 8007:
- 7 of them are session 1313181 turns 9–14 (130k+ tokens each, agentic long context, ~50% cache hit)
- Several others are cold-start turn-1 of large-prompt sessions
- First two slow reqs arrived 0.7 s apart — strong hint of concurrent picker race
Iteration 3: race-condition fix
Diagnosis. In _handle_combined:
chosen, best_idx, decision = pick_instance_unified_hybrid(...) # sync
# ... sync breakdown updates ...
return await _handle_local_request(...) # ← await yields here
# THEN reservation happens
return await async_func(...) evaluates the async call (creates coroutine)
and yields to the event loop before the coroutine body executes. The
reservation (chosen.pending_prefill_tokens += new, etc.) lives at the top
of _handle_local_request, so between the picker and the reservation there
is a window where another coroutine can run and re-pick the same instance.
When two big-prompt reqs arrive within milliseconds, both run pick → both pick the "free" 8007 → both yield → both reserve. Result: 8007 gets back-to-back 130k-token cold prefills, each waiting for the other.
Fix. Move the reservation before the await, inside _handle_combined:
# Race fix: reserve atomically with pick, before any await.
chosen.ongoing_tokens += input_length
chosen.pending_prefill_tokens += estimated_new
chosen.num_requests += 1
return await _handle_local_request(..., _pre_reserved=True)
_handle_local_request skips its own reservation when _pre_reserved=True.
PD-sep paths are unaffected (they have their own reservation).
Run. Pending — unified_of13_lmw001_racefix_*. Hypothesis: 8007 p90
drops to within ±3s of cluster median, since concurrent picks for the
same "free" instance no longer happen.
A+B'+RaceFix — results
Run. unified_of13_lmw001_racefix_20260527_1821/unified/.
| Metric (ms) | baseline | A+B' | A+B'+RF | Δ vs baseline |
|---|---|---|---|---|
| TTFT p50 | 520 | 485 | 478 | −8% |
| TTFT p90 | 8781 | 8287 | 7770 | −11.5% ✅ |
| TTFT p99 | 47647 | 41876 | 42447 | −11% |
| TPOT p90 | 17.8 | 17.5 | 18.0 | +1% |
| E2E p90 | 19989 | 20625 | 18418 | −8% |
| E2E p99 | 85841 | 77827 | 71227 | −17% |
vs v3 reference:
- A+B'+RF TTFT p90 = 7770ms, vs v3 Mech B 9711ms → −20% ✅
Long-tail counts (best across all variants):
> 5s: 170 → 158 > 30s: 41 → 33
>10s: 105 → 103 > 50s: 8 → 4
>20s: 65 → 57 >100s: 0 → 0
Decision split — race fix mainly helped affinity
| Decision | baseline | A+B'+RF |
|---|---|---|
| affinity p90 | 7011 | 5042 ✅ (−28%) |
| fallback p90 | 12083 | 13944 (+15%) |
The race-condition was hurting affinity decisions the most. When two concurrent reqs both stuck to a "free-looking" affinity instance, they piled up and inflated affinity's tail. Fix removed this collision.
Per-worker
8000: n=86 p90=11541 8004: n=150 p90=11906
8001: n=186 p90= 8307 8005: n=109 p90= 4798
8002: n=105 p90=14540 8006: n=183 p90= 6258
8003: n=264 p90= 3079 8007: n=131 p90=21850 ← still hot
8000 spread now 86..264 — race fix did disperse routing
8007 still hot — but it's workload-inherent, not a routing bug
Top sessions on 8007:
session 1279412: n=22 mean= 2208 max=18985 decisions: 91% affinity
session 1313181: n=17 mean=17399 max=49089 decisions: 65% affinity
session 1262354: n=15 mean= 622 max= 2325 decisions: 87% affinity
session 1342921: n= 7 mean=17817 max=55589 decisions: 86% affinity
session 1260327: n= 8 mean= 1636 max= 5382 decisions: 75% affinity
session 1268831: n= 5 mean= 1443 max= 2673 decisions: 80% affinity
Sessions 1313181 and 1342921 are long agentic contexts: 100k–130k tokens per turn with ~50% cache hit (i.e. 50k new tokens prefill per turn). Even on a perfectly load-balanced instance, each turn is 7–15s of pure compute.
Forcing these sessions to spread across instances would mean cold prefill every turn (0% cache hit) → each turn becomes 20–30s instead of 7–15s. Spreading is net-negative.
→ The 8007 p90=22s is the floor imposed by these sessions' structure, not by routing policy. Unified is at its ceiling for this workload.
Final ranking and take-aways
| Policy | TTFT p90 (ms) | Δ vs baseline | Notes |
|---|---|---|---|
| baseline unified (of=2.0) | 8781 | — | reference |
| A (of=1.3) | 8730 | ≈0 | affinity p90 -17%, fallback p90 +27% |
| A+B (of=1.3, lmw=0.01, BUG) | 8421 | −4% | 8007 hotspot from *num_req zeroing bug |
| A+B' (formula fix) | 8287 | −5.6% | Bug fixed, still 8007 mild hotspot |
| A+B'+RaceFix | 7770 | −11.5% ✅ | Best unified variant |
| v3 fixed | 10828 | +23% | PD-sep migration, cache-blind picker |
| v3 + Mech B | 9711 | +11% | PD-sep + cache-rich target picker |
Conclusions
-
Unified path beats v3 PD-sep on this workload by 20%+ TTFT p90. PD-sep migration's fixed cost (src prefill + dst first-token waiting on loaded scheduler) outweighs any decode-time savings for short-output agentic turns.
-
Three orthogonal fixes compound for a 11.5% TTFT p90 win:
- A (
overload_factor=1.3): tighter affinity overflow → −0.6% but much cleaner affinity decisions (p90 -17%) - B' (
lmetric_decode_weight=0.01withmax(num_req,1)): decode-aware fallback → −3.5% - RaceFix (atomic reserve before await): kills concurrent-pick collisions → −5.6%
- A (
-
Race condition was the biggest single hidden bug.
return await async_func(...)yields to the event loop before the body ofasync_funcruns, so reservations done in the body don't take effect in time to deter concurrent picks. This affects ANY async dispatch with separate pick/reserve steps — worth checking other routing policies. -
8007 p90=22s is workload-inherent. Sessions with 100k+ token turns at 50% cache hit cannot finish faster than 7–15s per turn regardless of routing. Forcing spread would hurt rather than help.
-
Migration (v3) is not necessary when unified routing is tuned well. Save the PD-sep mechanism for cases where it can be proven net-positive (e.g. very-long-output sessions on extremely overloaded prefill hosts) and use unified A+B'+RaceFix as the default.
Direction A+B — run pending
(Will be filled when unified_of13_lmw001_*/unified/ finishes.)