Files
agentic-kvc/microbench/connector_tax/cache_sweep/UNIFIED_ABLATION.md
Gahow Wang 67fcec7933 Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:52:44 +08:00

16 KiB
Raw Blame History

Unified routing ablation: A (tighter affinity) + B (decode-aware LMetric)

Goal: judge whether unified (cache-aware hybrid affinity + LMetric fallback) has enough headroom to surpass v3 migration-based routing on agentic workloads, without invoking PD-sep migration.

Workload / baseline

  • Trace: w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions)
  • Hardware: 8 × H100 (dash0), Qwen3-Coder-30B-A3B, TP=1, max_model_len=200000
  • Trace replay through cache_aware_proxy.py with policy unified
  • b3_replay_20260527_0114/unified/ reference
Metric (ms) baseline (overload_factor=2.0)
TTFT p50 520
TTFT p90 8781
TTFT p99 47647
TPOT p90 17.8
E2E p90 19989
E2E p99 85841

Reference points we're trying to beat / match:

  • v3 fixed rotation (cache-blind picker): TTFT p90 = 10828
  • v3 + Mechanism B (cache-rich picker): TTFT p90 = 9711
  • All v3 variants are +1023% worse than unified baseline.

Tail-source diagnostic on baseline

Decision split, baseline unified:

Decision n TTFT mean TTFT p90 TTFT p99
affinity 852 3183 7011 47432
lmetric_fallback 362 4285 12083 46036

Long-tail (>20s, n=65):

  • 40 / 65 came from affinity decisions
  • 25 / 65 came from lmetric_fallback

For the 40 slow affinity reqs:

  • only 12 / 40 were actually overloaded at decision time (aff_num_req > avg_num_req)
  • overload ratio at decision: mean=0.93, p50=0.87
  • most slow affinity reqs looked fine when the picker stuck — load piled on after dispatch.

This is a snapshot-based-routing limitation. Tightening overload_factor only helps the genuine cases above the new threshold — expected to be a 5-10% improvement at best.


Direction A — tighten affinity overflow

Hypothesis. overload_factor=2.0 lets the picker stick to affinity even when affinity.num_req is up to 2× the cluster average. Reducing to 1.3 forces earlier overflow to LMetric fallback, escaping busy affinity hosts before the tail blows up.

Change. Single CLI flag: --overload-factor 1.3. No code change.

Run. unified_of13_20260527_1532/unified/.

A vs baseline

Metric (ms) baseline (of=2.0) A (of=1.3) Δ
TTFT p50 520 495 5%
TTFT p90 8781 8730 ≈0
TTFT p99 47647 43059 10%
TPOT p50 7.9 8.0 ≈0
TPOT p90 17.8 15.5 13%
E2E p50 1761 1824 +4%
E2E p90 19989 18407 8%
E2E p99 85841 71396 17%

TTFT p90 is essentially unchanged but the deeper tail (p99) and TPOT both improved meaningfully. Net: A alone gives roughly 10% to 17% on the long tail without hurting medians.

Decision split, A vs baseline

Decision baseline n / p90 A n / p90 Δ p90
affinity 852 / 7011 817 / 5817 17%
lmetric_fallback 362 / 12083 397 / 15360 +27% ⚠️

The picker now sticks to affinity 35 fewer times. The remaining affinity decisions are higher-quality (no longer "barely-fitting" cases), so their p90 drops 17%.

But the 35 extra reqs that got pushed into fallback got slower: fallback p90 went from 12083 → 15360. The LMetric scorer is selecting a worse instance for them.

Per-worker TTFT under A (of=1.3)

port 8000: n=  94  mean=4424  p90=12290    port 8004: n=192  mean=2597  p90=6968
port 8001: n= 135  mean=2779  p90= 5553    port 8005: n=202  mean=3102  p90=6113
port 8002: n=  88  mean=5827  p90=15804    port 8006: n=136  mean=4006  p90=10899
port 8003: n= 217  mean=2674  p90= 4598    port 8007: n=150  mean=3648  p90= 7025

Compared to baseline (88..217 reqs/port), A redistributes more evenly (88..217 still but distribution is fatter in the middle). port 8002 remains slow (p90 15.8s) — its cache pool seems to keep getting cold work routed there by LMetric.

Why A alone isn't enough

LMetric scorer (unified_hybrid fallback path):

score = (pending_prefill_tokens + new_uncached_tokens) * num_requests

This ignores ongoing_decode_tokens entirely. An instance with no pending prefill but 200k tokens currently in decode looks "ideal" (score=0×num_req=0) — yet a new request landing there waits behind slow decode iters caused by the large batch KV reads.

A pushes more requests into fallback, but fallback can't tell which instance is actually free. → Direction B is mandatory companion.


Direction B — decode-aware LMetric

Hypothesis. Adding a decode-load penalty to the LMetric score lets fallback distinguish "no prefill queued but heavy decode running" from "truly idle". Should restore fallback p90 ≤ 12s baseline level.

Change.

score = (pending_prefill + new + lmetric_decode_weight * ongoing_decode_tokens) * num_requests
  • lmetric_decode_weight=0.0 ⇒ original LMetric (control)
  • lmetric_decode_weight=0.01 ⇒ first experiment (rationale: 1 decode token in batch costs ~0.01 prefill-token-equivalent in scheduler iter time on H100 + Qwen3-30B-A3B)

CLI: --lmetric-decode-weight 0.01. Setting in code: cache_aware_proxy.py:Settings.lmetric_decode_weight.

Run. unified_of13_lmw001_20260527_1628/unified/.

A+B vs baseline / A

Metric (ms) baseline A (of=1.3) A+B (of=1.3, lmw=0.01) Δ vs baseline
TTFT p50 520 495 514 1%
TTFT p90 8781 8730 8421 4%
TTFT p99 47647 43059 44800 6%
TPOT p50 7.9 8.0 7.9 ≈0
TPOT p90 17.8 15.5 15.7 12%
E2E p50 1761 1824 1870 +6%
E2E p90 19989 18407 21064 +5% ⚠️
E2E p99 85841 71396 64344 25%

Long-tail counts:

thresh       baseline      A      A+B   v3 MechB
>  5000ms        170      173      170        177
> 10000ms        105      109      109        119
> 20000ms         65       64       59         78
> 30000ms         41       40       37         50
> 50000ms          8        5        6         14

A+B is best on every long-tail-count threshold ≤30s, marginal worse at 50s.

Decision split (A+B vs A)

Decision A (of=1.3) A+B Note
affinity p90 5817 5836 ≈ same
fallback p90 15360 13501 B recovered some of A's fallback regression

B partially fixed fallback's selection (12% on fallback p90 vs A alone), but still worse than baseline (12083).

Per-worker TTFT (A+B)

port 8000: n=134  mean=3495  p90=10967      port 8004: n=136  mean=3102  p90= 7906
port 8001: n=143  mean=2981  p90=10189      port 8005: n=179  mean=1624  p90= 2735
port 8002: n=221  mean=2355  p90= 3502      port 8006: n=137  mean=5356  p90= 9628
port 8003: n=146  mean=3932  p90=10729      port 8007: n=118  mean=5210  p90=26798  ← new hotspot

A+B trades the baseline's 8002 hotspot (p90=35s) for a new 8007 hotspot (p90=26.8s). Lower amplitude but hotspot survives.

Why 8007 became a hotspot under A+B — found a bug in B

8007 in A+B: 118 reqs, 53% affinity / 47% fallback (vs other ports 6077% affinity), cache_hit_mean=50.5% (lowest).

Top-10 slowest at 8007: all are big-prompt (100k+ tokens) fallback decisions with cached_tokens=0 (cold prefill). LMetric is pushing many cold-prefill fallbacks to 8007.

Looking at the B formula:

decode_pen = lmetric_decode_weight * ongoing_decode_tokens
score = (pending_prefill + new + decode_pen) * num_requests   # ← BUG

When num_requests = 0, the entire score (including decode penalty) zeros out. So an idle-but-decoding host (num_req=0 because its last prefill finished but decode is still running) looks like score=0, beating every busy host.

Fix (B'): multiply by max(num_requests, 1):

score = (pending_prefill + new + decode_pen) * max(num_requests, 1)

Now idle hosts with high decode load get score = decode_pen × 1 = real nonzero penalty, beating zero-load hosts only when decode is small.

A+B' — re-run with the fix

Run. unified_of13_lmw001_v2_20260527_1724/unified/.

Metric (ms) baseline A+B (BUG) A+B' (fix) Δ vs baseline
TTFT p50 520 514 485 7%
TTFT p90 8781 8421 8287 5.6%
TTFT p99 47647 44800 41876 12%
TPOT p90 17.8 15.7 17.5 2%
E2E p90 19989 21064 20625 +3%
E2E p99 85841 64344 77827 9%

A+B' best of all variants on TTFT p90 (8287) and TTFT p99 (41876). Long-tail counts (>30s, >50s) also best across variants.

vs v3 reference points:

TTFT p90 TPOT p90 E2E p99
A+B' 8287 17.5 77827
v3 fixed (cache-blind) 10828 21.0 47610
v3 + Mech B 9711 18.3 84492

A+B' beats v3 Mech B by 15% TTFT p90 with no migration overhead.

Per-worker (A+B' fixed)

8000: n=158  p90= 5688      8004: n=189  p90= 4249
8001: n=159  p90= 7323      8005: n=116  p90=14598
8002: n=114  p90= 8726      8006: n=180  p90= 6198
8003: n=173  p90= 6715      8007: n=125  p90=22242   ← still hot

A+B' redistributed load more evenly (114..189) but 8007 still has p90=22s.

8007 deep-dive in A+B'

8007: n=125, affinity=69 (55%), fallback=56 (45%), cache_hit_mean=lowest

Top-15 slow at 8007:

  • 7 of them are session 1313181 turns 914 (130k+ tokens each, agentic long context, ~50% cache hit)
  • Several others are cold-start turn-1 of large-prompt sessions
  • First two slow reqs arrived 0.7 s apart — strong hint of concurrent picker race

Iteration 3: race-condition fix

Diagnosis. In _handle_combined:

chosen, best_idx, decision = pick_instance_unified_hybrid(...)  # sync
# ... sync breakdown updates ...
return await _handle_local_request(...)   # ← await yields here
                                          #   THEN reservation happens

return await async_func(...) evaluates the async call (creates coroutine) and yields to the event loop before the coroutine body executes. The reservation (chosen.pending_prefill_tokens += new, etc.) lives at the top of _handle_local_request, so between the picker and the reservation there is a window where another coroutine can run and re-pick the same instance.

When two big-prompt reqs arrive within milliseconds, both run pick → both pick the "free" 8007 → both yield → both reserve. Result: 8007 gets back-to-back 130k-token cold prefills, each waiting for the other.

Fix. Move the reservation before the await, inside _handle_combined:

# Race fix: reserve atomically with pick, before any await.
chosen.ongoing_tokens += input_length
chosen.pending_prefill_tokens += estimated_new
chosen.num_requests += 1
return await _handle_local_request(..., _pre_reserved=True)

_handle_local_request skips its own reservation when _pre_reserved=True. PD-sep paths are unaffected (they have their own reservation).

Run. Pending — unified_of13_lmw001_racefix_*. Hypothesis: 8007 p90 drops to within ±3s of cluster median, since concurrent picks for the same "free" instance no longer happen.


A+B'+RaceFix — results

Run. unified_of13_lmw001_racefix_20260527_1821/unified/.

Metric (ms) baseline A+B' A+B'+RF Δ vs baseline
TTFT p50 520 485 478 8%
TTFT p90 8781 8287 7770 11.5%
TTFT p99 47647 41876 42447 11%
TPOT p90 17.8 17.5 18.0 +1%
E2E p90 19989 20625 18418 8%
E2E p99 85841 77827 71227 17%

vs v3 reference:

  • A+B'+RF TTFT p90 = 7770ms, vs v3 Mech B 9711ms → 20%

Long-tail counts (best across all variants):

> 5s:   170 → 158         > 30s:   41 → 33
>10s:   105 → 103         > 50s:    8 →  4
>20s:    65 →  57         >100s:    0 →  0

Decision split — race fix mainly helped affinity

Decision baseline A+B'+RF
affinity p90 7011 5042 (28%)
fallback p90 12083 13944 (+15%)

The race-condition was hurting affinity decisions the most. When two concurrent reqs both stuck to a "free-looking" affinity instance, they piled up and inflated affinity's tail. Fix removed this collision.

Per-worker

8000: n=86   p90=11541      8004: n=150  p90=11906
8001: n=186  p90= 8307      8005: n=109  p90= 4798
8002: n=105  p90=14540      8006: n=183  p90= 6258
8003: n=264  p90= 3079      8007: n=131  p90=21850   ← still hot
8000 spread now 86..264 — race fix did disperse routing

8007 still hot — but it's workload-inherent, not a routing bug

Top sessions on 8007:

session 1279412: n=22  mean= 2208  max=18985  decisions: 91% affinity
session 1313181: n=17  mean=17399  max=49089  decisions: 65% affinity
session 1262354: n=15  mean=  622  max= 2325  decisions: 87% affinity
session 1342921: n= 7  mean=17817  max=55589  decisions: 86% affinity
session 1260327: n= 8  mean= 1636  max= 5382  decisions: 75% affinity
session 1268831: n= 5  mean= 1443  max= 2673  decisions: 80% affinity

Sessions 1313181 and 1342921 are long agentic contexts: 100k130k tokens per turn with ~50% cache hit (i.e. 50k new tokens prefill per turn). Even on a perfectly load-balanced instance, each turn is 715s of pure compute.

Forcing these sessions to spread across instances would mean cold prefill every turn (0% cache hit) → each turn becomes 2030s instead of 715s. Spreading is net-negative.

→ The 8007 p90=22s is the floor imposed by these sessions' structure, not by routing policy. Unified is at its ceiling for this workload.


Final ranking and take-aways

Policy TTFT p90 (ms) Δ vs baseline Notes
baseline unified (of=2.0) 8781 reference
A (of=1.3) 8730 ≈0 affinity p90 -17%, fallback p90 +27%
A+B (of=1.3, lmw=0.01, BUG) 8421 4% 8007 hotspot from *num_req zeroing bug
A+B' (formula fix) 8287 5.6% Bug fixed, still 8007 mild hotspot
A+B'+RaceFix 7770 11.5% Best unified variant
v3 fixed 10828 +23% PD-sep migration, cache-blind picker
v3 + Mech B 9711 +11% PD-sep + cache-rich target picker

Conclusions

  1. Unified path beats v3 PD-sep on this workload by 20%+ TTFT p90. PD-sep migration's fixed cost (src prefill + dst first-token waiting on loaded scheduler) outweighs any decode-time savings for short-output agentic turns.

  2. Three orthogonal fixes compound for a 11.5% TTFT p90 win:

    • A (overload_factor=1.3): tighter affinity overflow → 0.6% but much cleaner affinity decisions (p90 -17%)
    • B' (lmetric_decode_weight=0.01 with max(num_req,1)): decode-aware fallback → 3.5%
    • RaceFix (atomic reserve before await): kills concurrent-pick collisions → 5.6%
  3. Race condition was the biggest single hidden bug. return await async_func(...) yields to the event loop before the body of async_func runs, so reservations done in the body don't take effect in time to deter concurrent picks. This affects ANY async dispatch with separate pick/reserve steps — worth checking other routing policies.

  4. 8007 p90=22s is workload-inherent. Sessions with 100k+ token turns at 50% cache hit cannot finish faster than 715s per turn regardless of routing. Forcing spread would hurt rather than help.

  5. Migration (v3) is not necessary when unified routing is tuned well. Save the PD-sep mechanism for cases where it can be proven net-positive (e.g. very-long-output sessions on extremely overloaded prefill hosts) and use unified A+B'+RaceFix as the default.


Direction A+B — run pending

(Will be filled when unified_of13_lmw001_*/unified/ finishes.)