Files

Gahow Wang 67fcec7933 Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot

cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 11:52:44 +08:00

16 KiB

Raw Permalink Blame History

Unified routing ablation: A (tighter affinity) + B (decode-aware LMetric)

Goal: judge whether unified (cache-aware hybrid affinity + LMetric fallback) has enough headroom to surpass v3 migration-based routing on agentic workloads, without invoking PD-sep migration.

Workload / baseline

Trace: w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions)
Hardware: 8 × H100 (dash0), Qwen3-Coder-30B-A3B, TP=1, max_model_len=200000
Trace replay through cache_aware_proxy.py with policy unified
b3_replay_20260527_0114/unified/ reference

Metric (ms)	baseline (`overload_factor=2.0`)
TTFT p50	520
TTFT p90	8781
TTFT p99	47647
TPOT p90	17.8
E2E p90	19989
E2E p99	85841

Reference points we're trying to beat / match:

v3 fixed rotation (cache-blind picker): TTFT p90 = 10828
v3 + Mechanism B (cache-rich picker): TTFT p90 = 9711
All v3 variants are +10–23% worse than unified baseline.

Tail-source diagnostic on baseline

Decision split, baseline unified:

Decision	n	TTFT mean	TTFT p90	TTFT p99
affinity	852	3183	7011	47432
lmetric_fallback	362	4285	12083	46036

Long-tail (>20s, n=65):

40 / 65 came from affinity decisions
25 / 65 came from lmetric_fallback

For the 40 slow affinity reqs:

only 12 / 40 were actually overloaded at decision time (aff_num_req > avg_num_req)
overload ratio at decision: mean=0.93, p50=0.87
most slow affinity reqs looked fine when the picker stuck — load piled on after dispatch.

This is a snapshot-based-routing limitation. Tightening overload_factor only helps the genuine cases above the new threshold — expected to be a 5-10% improvement at best.

Direction A — tighten affinity overflow

Hypothesis. overload_factor=2.0 lets the picker stick to affinity even when affinity.num_req is up to 2× the cluster average. Reducing to 1.3 forces earlier overflow to LMetric fallback, escaping busy affinity hosts before the tail blows up.

Change. Single CLI flag: --overload-factor 1.3. No code change.

Run. unified_of13_20260527_1532/unified/.

A vs baseline

Metric (ms)	baseline (of=2.0)	A (of=1.3)	Δ
TTFT p50	520	495	−5%
TTFT p90	8781	8730	≈0
TTFT p99	47647	43059	−10%
TPOT p50	7.9	8.0	≈0
TPOT p90	17.8	15.5	−13%
E2E p50	1761	1824	+4%
E2E p90	19989	18407	−8%
E2E p99	85841	71396	−17%

TTFT p90 is essentially unchanged but the deeper tail (p99) and TPOT both improved meaningfully. Net: A alone gives roughly −10% to −17% on the long tail without hurting medians.

Decision split, A vs baseline

Decision	baseline n / p90	A n / p90	Δ p90
affinity	852 / 7011	817 / 5817	−17% ✅
lmetric_fallback	362 / 12083	397 / 15360	+27% ⚠️

The picker now sticks to affinity 35 fewer times. The remaining affinity decisions are higher-quality (no longer "barely-fitting" cases), so their p90 drops 17%.

But the 35 extra reqs that got pushed into fallback got slower: fallback p90 went from 12083 → 15360. The LMetric scorer is selecting a worse instance for them.

Per-worker TTFT under A (of=1.3)

port 8000: n=  94  mean=4424  p90=12290    port 8004: n=192  mean=2597  p90=6968
port 8001: n= 135  mean=2779  p90= 5553    port 8005: n=202  mean=3102  p90=6113
port 8002: n=  88  mean=5827  p90=15804    port 8006: n=136  mean=4006  p90=10899
port 8003: n= 217  mean=2674  p90= 4598    port 8007: n=150  mean=3648  p90= 7025

Compared to baseline (88..217 reqs/port), A redistributes more evenly (88..217 still but distribution is fatter in the middle). port 8002 remains slow (p90 15.8s) — its cache pool seems to keep getting cold work routed there by LMetric.

Why A alone isn't enough

LMetric scorer (unified_hybrid fallback path):

score = (pending_prefill_tokens + new_uncached_tokens) * num_requests

This ignores ongoing_decode_tokens entirely. An instance with no pending prefill but 200k tokens currently in decode looks "ideal" (score=0×num_req=0) — yet a new request landing there waits behind slow decode iters caused by the large batch KV reads.

A pushes more requests into fallback, but fallback can't tell which instance is actually free. → Direction B is mandatory companion.

Direction B — decode-aware LMetric

Hypothesis. Adding a decode-load penalty to the LMetric score lets fallback distinguish "no prefill queued but heavy decode running" from "truly idle". Should restore fallback p90 ≤ 12s baseline level.

Change.

score = (pending_prefill + new + lmetric_decode_weight * ongoing_decode_tokens) * num_requests

lmetric_decode_weight=0.0 ⇒ original LMetric (control)
lmetric_decode_weight=0.01 ⇒ first experiment (rationale: 1 decode token in batch costs ~0.01 prefill-token-equivalent in scheduler iter time on H100 + Qwen3-30B-A3B)

CLI: --lmetric-decode-weight 0.01. Setting in code: cache_aware_proxy.py:Settings.lmetric_decode_weight.

Run. unified_of13_lmw001_20260527_1628/unified/.

A+B vs baseline / A

Metric (ms)	baseline	A (of=1.3)	A+B (of=1.3, lmw=0.01)	Δ vs baseline
TTFT p50	520	495	514	−1%
TTFT p90	8781	8730	8421	−4% ✅
TTFT p99	47647	43059	44800	−6%
TPOT p50	7.9	8.0	7.9	≈0
TPOT p90	17.8	15.5	15.7	−12%
E2E p50	1761	1824	1870	+6%
E2E p90	19989	18407	21064	+5% ⚠️
E2E p99	85841	71396	64344	−25% ✅

Long-tail counts:

thresh       baseline      A      A+B   v3 MechB
>  5000ms        170      173      170        177
> 10000ms        105      109      109        119
> 20000ms         65       64       59         78
> 30000ms         41       40       37         50
> 50000ms          8        5        6         14

A+B is best on every long-tail-count threshold ≤30s, marginal worse at 50s.

Decision split (A+B vs A)

Decision	A (of=1.3)	A+B	Note
affinity p90	5817	5836	≈ same
fallback p90	15360	13501	B recovered some of A's fallback regression

B partially fixed fallback's selection (−12% on fallback p90 vs A alone), but still worse than baseline (12083).

Per-worker TTFT (A+B)

port 8000: n=134  mean=3495  p90=10967      port 8004: n=136  mean=3102  p90= 7906
port 8001: n=143  mean=2981  p90=10189      port 8005: n=179  mean=1624  p90= 2735
port 8002: n=221  mean=2355  p90= 3502      port 8006: n=137  mean=5356  p90= 9628
port 8003: n=146  mean=3932  p90=10729      port 8007: n=118  mean=5210  p90=26798  ← new hotspot

A+B trades the baseline's 8002 hotspot (p90=35s) for a new 8007 hotspot (p90=26.8s). Lower amplitude but hotspot survives.

Why 8007 became a hotspot under A+B — found a bug in B

8007 in A+B: 118 reqs, 53% affinity / 47% fallback (vs other ports 60–77% affinity), cache_hit_mean=50.5% (lowest).

Top-10 slowest at 8007: all are big-prompt (100k+ tokens) fallback decisions with cached_tokens=0 (cold prefill). LMetric is pushing many cold-prefill fallbacks to 8007.

Looking at the B formula:

decode_pen = lmetric_decode_weight * ongoing_decode_tokens
score = (pending_prefill + new + decode_pen) * num_requests   # ← BUG

When num_requests = 0, the entire score (including decode penalty) zeros out. So an idle-but-decoding host (num_req=0 because its last prefill finished but decode is still running) looks like score=0, beating every busy host.

Fix (B'): multiply by max(num_requests, 1):

score = (pending_prefill + new + decode_pen) * max(num_requests, 1)

Now idle hosts with high decode load get score = decode_pen × 1 = real nonzero penalty, beating zero-load hosts only when decode is small.

A+B' — re-run with the fix

Run. unified_of13_lmw001_v2_20260527_1724/unified/.

Metric (ms)	baseline	A+B (BUG)	A+B' (fix)	Δ vs baseline
TTFT p50	520	514	485	−7%
TTFT p90	8781	8421	8287	−5.6% ✅
TTFT p99	47647	44800	41876	−12% ✅
TPOT p90	17.8	15.7	17.5	−2%
E2E p90	19989	21064	20625	+3%
E2E p99	85841	64344	77827	−9%

A+B' best of all variants on TTFT p90 (8287) and TTFT p99 (41876). Long-tail counts (>30s, >50s) also best across variants.

vs v3 reference points:

	TTFT p90	TPOT p90	E2E p99
A+B'	8287	17.5	77827
v3 fixed (cache-blind)	10828	21.0	47610
v3 + Mech B	9711	18.3	84492

A+B' beats v3 Mech B by 15% TTFT p90 with no migration overhead.

Per-worker (A+B' fixed)

8000: n=158  p90= 5688      8004: n=189  p90= 4249
8001: n=159  p90= 7323      8005: n=116  p90=14598
8002: n=114  p90= 8726      8006: n=180  p90= 6198
8003: n=173  p90= 6715      8007: n=125  p90=22242   ← still hot

A+B' redistributed load more evenly (114..189) but 8007 still has p90=22s.

8007 deep-dive in A+B'

8007: n=125, affinity=69 (55%), fallback=56 (45%), cache_hit_mean=lowest

Top-15 slow at 8007:

7 of them are session 1313181 turns 9–14 (130k+ tokens each, agentic long context, ~50% cache hit)
Several others are cold-start turn-1 of large-prompt sessions
First two slow reqs arrived 0.7 s apart — strong hint of concurrent picker race

Iteration 3: race-condition fix

Diagnosis. In _handle_combined:

chosen, best_idx, decision = pick_instance_unified_hybrid(...)  # sync
# ... sync breakdown updates ...
return await _handle_local_request(...)   # ← await yields here
                                          #   THEN reservation happens

return await async_func(...) evaluates the async call (creates coroutine) and yields to the event loop before the coroutine body executes. The reservation (chosen.pending_prefill_tokens += new, etc.) lives at the top of _handle_local_request, so between the picker and the reservation there is a window where another coroutine can run and re-pick the same instance.

When two big-prompt reqs arrive within milliseconds, both run pick → both pick the "free" 8007 → both yield → both reserve. Result: 8007 gets back-to-back 130k-token cold prefills, each waiting for the other.

Fix. Move the reservation before the await, inside _handle_combined:

# Race fix: reserve atomically with pick, before any await.
chosen.ongoing_tokens += input_length
chosen.pending_prefill_tokens += estimated_new
chosen.num_requests += 1
return await _handle_local_request(..., _pre_reserved=True)

_handle_local_request skips its own reservation when _pre_reserved=True. PD-sep paths are unaffected (they have their own reservation).

Run. Pending — unified_of13_lmw001_racefix_*. Hypothesis: 8007 p90 drops to within ±3s of cluster median, since concurrent picks for the same "free" instance no longer happen.

A+B'+RaceFix — results

Run. unified_of13_lmw001_racefix_20260527_1821/unified/.

Metric (ms)	baseline	A+B'	A+B'+RF	Δ vs baseline
TTFT p50	520	485	478	−8%
TTFT p90	8781	8287	7770	−11.5% ✅
TTFT p99	47647	41876	42447	−11%
TPOT p90	17.8	17.5	18.0	+1%
E2E p90	19989	20625	18418	−8%
E2E p99	85841	77827	71227	−17%

vs v3 reference:

A+B'+RF TTFT p90 = 7770ms, vs v3 Mech B 9711ms → −20% ✅

Long-tail counts (best across all variants):

> 5s:   170 → 158         > 30s:   41 → 33
>10s:   105 → 103         > 50s:    8 →  4
>20s:    65 →  57         >100s:    0 →  0

Decision split — race fix mainly helped affinity

Decision	baseline	A+B'+RF
affinity p90	7011	5042 ✅ (−28%)
fallback p90	12083	13944 (+15%)

The race-condition was hurting affinity decisions the most. When two concurrent reqs both stuck to a "free-looking" affinity instance, they piled up and inflated affinity's tail. Fix removed this collision.

Per-worker

8000: n=86   p90=11541      8004: n=150  p90=11906
8001: n=186  p90= 8307      8005: n=109  p90= 4798
8002: n=105  p90=14540      8006: n=183  p90= 6258
8003: n=264  p90= 3079      8007: n=131  p90=21850   ← still hot
8000 spread now 86..264 — race fix did disperse routing

8007 still hot — but it's workload-inherent, not a routing bug

Top sessions on 8007:

session 1279412: n=22  mean= 2208  max=18985  decisions: 91% affinity
session 1313181: n=17  mean=17399  max=49089  decisions: 65% affinity
session 1262354: n=15  mean=  622  max= 2325  decisions: 87% affinity
session 1342921: n= 7  mean=17817  max=55589  decisions: 86% affinity
session 1260327: n= 8  mean= 1636  max= 5382  decisions: 75% affinity
session 1268831: n= 5  mean= 1443  max= 2673  decisions: 80% affinity

Sessions 1313181 and 1342921 are long agentic contexts: 100k–130k tokens per turn with ~50% cache hit (i.e. 50k new tokens prefill per turn). Even on a perfectly load-balanced instance, each turn is 7–15s of pure compute.

Forcing these sessions to spread across instances would mean cold prefill every turn (0% cache hit) → each turn becomes 20–30s instead of 7–15s. Spreading is net-negative.

→ The 8007 p90=22s is the floor imposed by these sessions' structure, not by routing policy. Unified is at its ceiling for this workload.

Final ranking and take-aways

Policy	TTFT p90 (ms)	Δ vs baseline	Notes
baseline unified (of=2.0)	8781	—	reference
A (of=1.3)	8730	≈0	affinity p90 -17%, fallback p90 +27%
A+B (of=1.3, lmw=0.01, BUG)	8421	−4%	8007 hotspot from `*num_req` zeroing bug
A+B' (formula fix)	8287	−5.6%	Bug fixed, still 8007 mild hotspot
A+B'+RaceFix	7770	−11.5% ✅	Best unified variant
v3 fixed	10828	+23%	PD-sep migration, cache-blind picker
v3 + Mech B	9711	+11%	PD-sep + cache-rich target picker

Conclusions

Unified path beats v3 PD-sep on this workload by 20%+ TTFT p90. PD-sep migration's fixed cost (src prefill + dst first-token waiting on loaded scheduler) outweighs any decode-time savings for short-output agentic turns.
Three orthogonal fixes compound for a 11.5% TTFT p90 win:
- A (overload_factor=1.3): tighter affinity overflow → −0.6% but much cleaner affinity decisions (p90 -17%)
- B' (lmetric_decode_weight=0.01 with max(num_req,1)): decode-aware fallback → −3.5%
- RaceFix (atomic reserve before await): kills concurrent-pick collisions → −5.6%
Race condition was the biggest single hidden bug. return await async_func(...) yields to the event loop before the body of async_func runs, so reservations done in the body don't take effect in time to deter concurrent picks. This affects ANY async dispatch with separate pick/reserve steps — worth checking other routing policies.
8007 p90=22s is workload-inherent. Sessions with 100k+ token turns at 50% cache hit cannot finish faster than 7–15s per turn regardless of routing. Forcing spread would hurt rather than help.
Migration (v3) is not necessary when unified routing is tuned well. Save the PD-sep mechanism for cases where it can be proven net-positive (e.g. very-long-output sessions on extremely overloaded prefill hosts) and use unified A+B'+RaceFix as the default.

Direction A+B — run pending

(Will be filled when unified_of13_lmw001_*/unified/ finishes.)

16 KiB Raw Permalink Blame History Unescape Escape