Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
431
microbench/connector_tax/cache_sweep/UNIFIED_ABLATION.md
Normal file
431
microbench/connector_tax/cache_sweep/UNIFIED_ABLATION.md
Normal file
@@ -0,0 +1,431 @@
|
||||
# Unified routing ablation: A (tighter affinity) + B (decode-aware LMetric)
|
||||
|
||||
Goal: judge whether `unified` (cache-aware hybrid affinity + LMetric fallback)
|
||||
has enough headroom to surpass v3 migration-based routing on agentic
|
||||
workloads, without invoking PD-sep migration.
|
||||
|
||||
## Workload / baseline
|
||||
|
||||
- Trace: `w600_r0.0015_st30.jsonl` (1214 reqs, 274 sessions)
|
||||
- Hardware: 8 × H100 (dash0), Qwen3-Coder-30B-A3B, TP=1, max_model_len=200000
|
||||
- Trace replay through `cache_aware_proxy.py` with policy `unified`
|
||||
- `b3_replay_20260527_0114/unified/` reference
|
||||
|
||||
| Metric (ms) | baseline (`overload_factor=2.0`) |
|
||||
|---|---:|
|
||||
| TTFT p50 | 520 |
|
||||
| TTFT p90 | **8781** |
|
||||
| TTFT p99 | 47647 |
|
||||
| TPOT p90 | 17.8 |
|
||||
| E2E p90 | 19989 |
|
||||
| E2E p99 | 85841 |
|
||||
|
||||
Reference points we're trying to beat / match:
|
||||
- v3 fixed rotation (cache-blind picker): TTFT p90 = 10828
|
||||
- v3 + Mechanism B (cache-rich picker): TTFT p90 = 9711
|
||||
- All v3 variants are +10–23% worse than `unified` baseline.
|
||||
|
||||
## Tail-source diagnostic on baseline
|
||||
|
||||
Decision split, baseline unified:
|
||||
|
||||
| Decision | n | TTFT mean | TTFT p90 | TTFT p99 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| affinity | 852 | 3183 | 7011 | 47432 |
|
||||
| lmetric_fallback | 362 | 4285 | 12083 | 46036 |
|
||||
|
||||
Long-tail (>20s, n=65):
|
||||
- 40 / 65 came from `affinity` decisions
|
||||
- 25 / 65 came from `lmetric_fallback`
|
||||
|
||||
For the 40 slow `affinity` reqs:
|
||||
- only 12 / 40 were actually overloaded at decision time (`aff_num_req > avg_num_req`)
|
||||
- overload ratio at decision: mean=0.93, p50=0.87
|
||||
- **most slow affinity reqs looked fine when the picker stuck — load piled
|
||||
on after dispatch**.
|
||||
|
||||
This is a snapshot-based-routing limitation. Tightening
|
||||
`overload_factor` only helps the genuine cases above the new threshold —
|
||||
expected to be a 5-10% improvement at best.
|
||||
|
||||
---
|
||||
|
||||
## Direction A — tighten affinity overflow
|
||||
|
||||
**Hypothesis.** `overload_factor=2.0` lets the picker stick to affinity
|
||||
even when `affinity.num_req` is up to 2× the cluster average. Reducing to
|
||||
1.3 forces earlier overflow to LMetric fallback, escaping busy affinity
|
||||
hosts before the tail blows up.
|
||||
|
||||
**Change.** Single CLI flag: `--overload-factor 1.3`. No code change.
|
||||
|
||||
**Run.** `unified_of13_20260527_1532/unified/`.
|
||||
|
||||
### A vs baseline
|
||||
|
||||
| Metric (ms) | baseline (of=2.0) | A (of=1.3) | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT p50 | 520 | 495 | −5% |
|
||||
| TTFT p90 | 8781 | 8730 | ≈0 |
|
||||
| TTFT p99 | 47647 | 43059 | −10% |
|
||||
| TPOT p50 | 7.9 | 8.0 | ≈0 |
|
||||
| TPOT p90 | 17.8 | **15.5** | **−13%** |
|
||||
| E2E p50 | 1761 | 1824 | +4% |
|
||||
| E2E p90 | 19989 | 18407 | −8% |
|
||||
| E2E p99 | 85841 | **71396** | **−17%** |
|
||||
|
||||
TTFT p90 is essentially unchanged but the **deeper tail (p99) and
|
||||
TPOT both improved meaningfully**. Net: A alone gives roughly −10% to
|
||||
−17% on the long tail without hurting medians.
|
||||
|
||||
### Decision split, A vs baseline
|
||||
|
||||
| Decision | baseline n / p90 | A n / p90 | Δ p90 |
|
||||
|---|---|---|---|
|
||||
| affinity | 852 / 7011 | 817 / **5817** | **−17%** ✅ |
|
||||
| lmetric_fallback | 362 / 12083 | 397 / **15360** | **+27%** ⚠️ |
|
||||
|
||||
The picker now sticks to affinity 35 fewer times. The remaining affinity
|
||||
decisions are higher-quality (no longer "barely-fitting" cases), so their
|
||||
p90 drops 17%.
|
||||
|
||||
But the 35 extra reqs that got pushed into fallback **got slower**:
|
||||
fallback p90 went from 12083 → 15360. The LMetric scorer is selecting a
|
||||
worse instance for them.
|
||||
|
||||
### Per-worker TTFT under A (of=1.3)
|
||||
|
||||
```
|
||||
port 8000: n= 94 mean=4424 p90=12290 port 8004: n=192 mean=2597 p90=6968
|
||||
port 8001: n= 135 mean=2779 p90= 5553 port 8005: n=202 mean=3102 p90=6113
|
||||
port 8002: n= 88 mean=5827 p90=15804 port 8006: n=136 mean=4006 p90=10899
|
||||
port 8003: n= 217 mean=2674 p90= 4598 port 8007: n=150 mean=3648 p90= 7025
|
||||
```
|
||||
|
||||
Compared to baseline (88..217 reqs/port), A redistributes more evenly
|
||||
(88..217 still but distribution is fatter in the middle). port 8002
|
||||
remains slow (p90 15.8s) — its cache pool seems to keep getting cold
|
||||
work routed there by LMetric.
|
||||
|
||||
### Why A alone isn't enough
|
||||
|
||||
LMetric scorer (`unified_hybrid` fallback path):
|
||||
|
||||
```python
|
||||
score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
|
||||
```
|
||||
|
||||
This **ignores `ongoing_decode_tokens`** entirely. An instance with no
|
||||
pending prefill but 200k tokens currently in decode looks "ideal"
|
||||
(score=0×num_req=0) — yet a new request landing there waits behind
|
||||
slow decode iters caused by the large batch KV reads.
|
||||
|
||||
A pushes more requests into fallback, but fallback can't tell which
|
||||
instance is actually free. → Direction B is mandatory companion.
|
||||
|
||||
---
|
||||
|
||||
## Direction B — decode-aware LMetric
|
||||
|
||||
**Hypothesis.** Adding a decode-load penalty to the LMetric score lets
|
||||
fallback distinguish "no prefill queued but heavy decode running" from
|
||||
"truly idle". Should restore fallback p90 ≤ 12s baseline level.
|
||||
|
||||
**Change.**
|
||||
```python
|
||||
score = (pending_prefill + new + lmetric_decode_weight * ongoing_decode_tokens) * num_requests
|
||||
```
|
||||
- `lmetric_decode_weight=0.0` ⇒ original LMetric (control)
|
||||
- `lmetric_decode_weight=0.01` ⇒ first experiment (rationale: 1 decode token
|
||||
in batch costs ~0.01 prefill-token-equivalent in scheduler iter time
|
||||
on H100 + Qwen3-30B-A3B)
|
||||
|
||||
CLI: `--lmetric-decode-weight 0.01`. Setting in code:
|
||||
`cache_aware_proxy.py:Settings.lmetric_decode_weight`.
|
||||
|
||||
**Run.** `unified_of13_lmw001_20260527_1628/unified/`.
|
||||
|
||||
### A+B vs baseline / A
|
||||
|
||||
| Metric (ms) | baseline | A (of=1.3) | A+B (of=1.3, lmw=0.01) | Δ vs baseline |
|
||||
|---|---:|---:|---:|---:|
|
||||
| TTFT p50 | 520 | 495 | 514 | −1% |
|
||||
| **TTFT p90** | 8781 | 8730 | **8421** | **−4%** ✅ |
|
||||
| TTFT p99 | 47647 | 43059 | 44800 | −6% |
|
||||
| TPOT p50 | 7.9 | 8.0 | 7.9 | ≈0 |
|
||||
| TPOT p90 | 17.8 | 15.5 | 15.7 | −12% |
|
||||
| E2E p50 | 1761 | 1824 | 1870 | +6% |
|
||||
| E2E p90 | 19989 | 18407 | **21064** | **+5%** ⚠️ |
|
||||
| E2E p99 | 85841 | 71396 | **64344** | **−25%** ✅ |
|
||||
|
||||
Long-tail counts:
|
||||
|
||||
```
|
||||
thresh baseline A A+B v3 MechB
|
||||
> 5000ms 170 173 170 177
|
||||
> 10000ms 105 109 109 119
|
||||
> 20000ms 65 64 59 78
|
||||
> 30000ms 41 40 37 50
|
||||
> 50000ms 8 5 6 14
|
||||
```
|
||||
|
||||
A+B is best on every long-tail-count threshold ≤30s, marginal worse at 50s.
|
||||
|
||||
### Decision split (A+B vs A)
|
||||
|
||||
| Decision | A (of=1.3) | A+B | Note |
|
||||
|---|---|---|---|
|
||||
| affinity p90 | 5817 | 5836 | ≈ same |
|
||||
| fallback p90 | **15360** | **13501** | B recovered some of A's fallback regression |
|
||||
|
||||
B partially fixed fallback's selection (−12% on fallback p90 vs A alone),
|
||||
but still worse than baseline (12083).
|
||||
|
||||
### Per-worker TTFT (A+B)
|
||||
|
||||
```
|
||||
port 8000: n=134 mean=3495 p90=10967 port 8004: n=136 mean=3102 p90= 7906
|
||||
port 8001: n=143 mean=2981 p90=10189 port 8005: n=179 mean=1624 p90= 2735
|
||||
port 8002: n=221 mean=2355 p90= 3502 port 8006: n=137 mean=5356 p90= 9628
|
||||
port 8003: n=146 mean=3932 p90=10729 port 8007: n=118 mean=5210 p90=26798 ← new hotspot
|
||||
```
|
||||
|
||||
A+B trades the baseline's 8002 hotspot (p90=35s) for a new 8007 hotspot
|
||||
(p90=26.8s). Lower amplitude but hotspot survives.
|
||||
|
||||
### Why 8007 became a hotspot under A+B — **found a bug in B**
|
||||
|
||||
8007 in A+B: 118 reqs, **53% affinity / 47% fallback** (vs other ports
|
||||
60–77% affinity), **cache_hit_mean=50.5% (lowest)**.
|
||||
|
||||
Top-10 slowest at 8007: all are big-prompt (100k+ tokens) fallback decisions
|
||||
with `cached_tokens=0` (cold prefill). LMetric is pushing many cold-prefill
|
||||
fallbacks to 8007.
|
||||
|
||||
Looking at the B formula:
|
||||
|
||||
```python
|
||||
decode_pen = lmetric_decode_weight * ongoing_decode_tokens
|
||||
score = (pending_prefill + new + decode_pen) * num_requests # ← BUG
|
||||
```
|
||||
|
||||
When `num_requests = 0`, the entire score (including decode penalty) zeros
|
||||
out. So an idle-but-decoding host (num_req=0 because its last prefill
|
||||
finished but decode is still running) looks like score=0, beating every
|
||||
busy host.
|
||||
|
||||
**Fix (B'):** multiply by `max(num_requests, 1)`:
|
||||
|
||||
```python
|
||||
score = (pending_prefill + new + decode_pen) * max(num_requests, 1)
|
||||
```
|
||||
|
||||
Now idle hosts with high decode load get score = decode_pen × 1 = real
|
||||
nonzero penalty, beating zero-load hosts only when decode is small.
|
||||
|
||||
### A+B' — re-run with the fix
|
||||
|
||||
**Run.** `unified_of13_lmw001_v2_20260527_1724/unified/`.
|
||||
|
||||
| Metric (ms) | baseline | A+B (BUG) | A+B' (fix) | Δ vs baseline |
|
||||
|---|---:|---:|---:|---:|
|
||||
| TTFT p50 | 520 | 514 | **485** | −7% |
|
||||
| **TTFT p90** | 8781 | 8421 | **8287** | **−5.6%** ✅ |
|
||||
| TTFT p99 | 47647 | 44800 | **41876** | **−12%** ✅ |
|
||||
| TPOT p90 | 17.8 | 15.7 | 17.5 | −2% |
|
||||
| E2E p90 | 19989 | 21064 | 20625 | +3% |
|
||||
| E2E p99 | 85841 | 64344 | 77827 | −9% |
|
||||
|
||||
A+B' **best of all variants on TTFT p90 (8287) and TTFT p99 (41876)**.
|
||||
Long-tail counts (>30s, >50s) also best across variants.
|
||||
|
||||
vs v3 reference points:
|
||||
| | TTFT p90 | TPOT p90 | E2E p99 |
|
||||
|---|---:|---:|---:|
|
||||
| **A+B'** | **8287** | 17.5 | 77827 |
|
||||
| v3 fixed (cache-blind) | 10828 | 21.0 | 47610 |
|
||||
| v3 + Mech B | 9711 | 18.3 | 84492 |
|
||||
|
||||
A+B' **beats v3 Mech B by 15% TTFT p90** with no migration overhead.
|
||||
|
||||
### Per-worker (A+B' fixed)
|
||||
|
||||
```
|
||||
8000: n=158 p90= 5688 8004: n=189 p90= 4249
|
||||
8001: n=159 p90= 7323 8005: n=116 p90=14598
|
||||
8002: n=114 p90= 8726 8006: n=180 p90= 6198
|
||||
8003: n=173 p90= 6715 8007: n=125 p90=22242 ← still hot
|
||||
```
|
||||
|
||||
A+B' redistributed load more evenly (114..189) but **8007 still has p90=22s**.
|
||||
|
||||
### 8007 deep-dive in A+B'
|
||||
|
||||
```
|
||||
8007: n=125, affinity=69 (55%), fallback=56 (45%), cache_hit_mean=lowest
|
||||
```
|
||||
|
||||
Top-15 slow at 8007:
|
||||
- 7 of them are session **1313181** turns 9–14 (130k+ tokens each, agentic
|
||||
long context, ~50% cache hit)
|
||||
- Several others are cold-start turn-1 of large-prompt sessions
|
||||
- First two slow reqs arrived **0.7 s apart** — strong hint of concurrent
|
||||
picker race
|
||||
|
||||
### Iteration 3: race-condition fix
|
||||
|
||||
**Diagnosis.** In `_handle_combined`:
|
||||
|
||||
```python
|
||||
chosen, best_idx, decision = pick_instance_unified_hybrid(...) # sync
|
||||
# ... sync breakdown updates ...
|
||||
return await _handle_local_request(...) # ← await yields here
|
||||
# THEN reservation happens
|
||||
```
|
||||
|
||||
`return await async_func(...)` evaluates the async call (creates coroutine)
|
||||
and yields to the event loop **before** the coroutine body executes. The
|
||||
reservation (`chosen.pending_prefill_tokens += new`, etc.) lives at the top
|
||||
of `_handle_local_request`, so between the picker and the reservation there
|
||||
is a **window where another coroutine can run and re-pick the same instance**.
|
||||
|
||||
When two big-prompt reqs arrive within milliseconds, both run pick →
|
||||
both pick the "free" 8007 → both yield → both reserve. Result: 8007 gets
|
||||
back-to-back 130k-token cold prefills, each waiting for the other.
|
||||
|
||||
**Fix.** Move the reservation **before** the await, inside `_handle_combined`:
|
||||
|
||||
```python
|
||||
# Race fix: reserve atomically with pick, before any await.
|
||||
chosen.ongoing_tokens += input_length
|
||||
chosen.pending_prefill_tokens += estimated_new
|
||||
chosen.num_requests += 1
|
||||
return await _handle_local_request(..., _pre_reserved=True)
|
||||
```
|
||||
|
||||
`_handle_local_request` skips its own reservation when `_pre_reserved=True`.
|
||||
PD-sep paths are unaffected (they have their own reservation).
|
||||
|
||||
**Run.** Pending — `unified_of13_lmw001_racefix_*`. Hypothesis: 8007 p90
|
||||
drops to within ±3s of cluster median, since concurrent picks for the
|
||||
same "free" instance no longer happen.
|
||||
|
||||
---
|
||||
|
||||
## A+B'+RaceFix — results
|
||||
|
||||
**Run.** `unified_of13_lmw001_racefix_20260527_1821/unified/`.
|
||||
|
||||
| Metric (ms) | baseline | A+B' | A+B'+RF | Δ vs baseline |
|
||||
|---|---:|---:|---:|---:|
|
||||
| TTFT p50 | 520 | 485 | **478** | −8% |
|
||||
| **TTFT p90** | 8781 | 8287 | **7770** | **−11.5%** ✅ |
|
||||
| TTFT p99 | 47647 | 41876 | **42447** | −11% |
|
||||
| TPOT p90 | 17.8 | 17.5 | 18.0 | +1% |
|
||||
| E2E p90 | 19989 | 20625 | **18418** | −8% |
|
||||
| E2E p99 | 85841 | 77827 | **71227** | −17% |
|
||||
|
||||
vs v3 reference:
|
||||
- **A+B'+RF TTFT p90 = 7770ms, vs v3 Mech B 9711ms → −20%** ✅
|
||||
|
||||
Long-tail counts (best across all variants):
|
||||
```
|
||||
> 5s: 170 → 158 > 30s: 41 → 33
|
||||
>10s: 105 → 103 > 50s: 8 → 4
|
||||
>20s: 65 → 57 >100s: 0 → 0
|
||||
```
|
||||
|
||||
### Decision split — race fix mainly helped affinity
|
||||
|
||||
| Decision | baseline | A+B'+RF |
|
||||
|---|---:|---:|
|
||||
| affinity p90 | 7011 | **5042** ✅ (−28%) |
|
||||
| fallback p90 | 12083 | 13944 (+15%) |
|
||||
|
||||
The race-condition was hurting affinity decisions the most. When two
|
||||
concurrent reqs both stuck to a "free-looking" affinity instance, they
|
||||
piled up and inflated affinity's tail. Fix removed this collision.
|
||||
|
||||
### Per-worker
|
||||
|
||||
```
|
||||
8000: n=86 p90=11541 8004: n=150 p90=11906
|
||||
8001: n=186 p90= 8307 8005: n=109 p90= 4798
|
||||
8002: n=105 p90=14540 8006: n=183 p90= 6258
|
||||
8003: n=264 p90= 3079 8007: n=131 p90=21850 ← still hot
|
||||
8000 spread now 86..264 — race fix did disperse routing
|
||||
```
|
||||
|
||||
### 8007 still hot — but it's **workload-inherent, not a routing bug**
|
||||
|
||||
Top sessions on 8007:
|
||||
```
|
||||
session 1279412: n=22 mean= 2208 max=18985 decisions: 91% affinity
|
||||
session 1313181: n=17 mean=17399 max=49089 decisions: 65% affinity
|
||||
session 1262354: n=15 mean= 622 max= 2325 decisions: 87% affinity
|
||||
session 1342921: n= 7 mean=17817 max=55589 decisions: 86% affinity
|
||||
session 1260327: n= 8 mean= 1636 max= 5382 decisions: 75% affinity
|
||||
session 1268831: n= 5 mean= 1443 max= 2673 decisions: 80% affinity
|
||||
```
|
||||
|
||||
Sessions 1313181 and 1342921 are **long agentic contexts**: 100k–130k tokens
|
||||
per turn with ~50% cache hit (i.e. 50k new tokens prefill per turn). Even
|
||||
on a perfectly load-balanced instance, each turn is 7–15s of pure compute.
|
||||
|
||||
Forcing these sessions to spread across instances would mean **cold prefill
|
||||
every turn (0% cache hit)** → each turn becomes 20–30s instead of 7–15s.
|
||||
Spreading is **net-negative**.
|
||||
|
||||
→ The 8007 p90=22s is the floor imposed by these sessions' structure,
|
||||
not by routing policy. Unified is at its ceiling for this workload.
|
||||
|
||||
---
|
||||
|
||||
## Final ranking and take-aways
|
||||
|
||||
| Policy | TTFT p90 (ms) | Δ vs baseline | Notes |
|
||||
|---|---:|---:|---|
|
||||
| baseline unified (of=2.0) | 8781 | — | reference |
|
||||
| A (of=1.3) | 8730 | ≈0 | affinity p90 -17%, fallback p90 +27% |
|
||||
| A+B (of=1.3, lmw=0.01, BUG) | 8421 | −4% | 8007 hotspot from `*num_req` zeroing bug |
|
||||
| A+B' (formula fix) | 8287 | −5.6% | Bug fixed, still 8007 mild hotspot |
|
||||
| **A+B'+RaceFix** | **7770** | **−11.5%** ✅ | **Best unified variant** |
|
||||
| v3 fixed | 10828 | +23% | PD-sep migration, cache-blind picker |
|
||||
| v3 + Mech B | 9711 | +11% | PD-sep + cache-rich target picker |
|
||||
|
||||
### Conclusions
|
||||
|
||||
1. **Unified path beats v3 PD-sep on this workload by 20%+ TTFT p90.**
|
||||
PD-sep migration's fixed cost (src prefill + dst first-token waiting on
|
||||
loaded scheduler) outweighs any decode-time savings for short-output
|
||||
agentic turns.
|
||||
|
||||
2. **Three orthogonal fixes compound for a 11.5% TTFT p90 win:**
|
||||
- A (`overload_factor=1.3`): tighter affinity overflow → −0.6% but
|
||||
much cleaner affinity decisions (p90 -17%)
|
||||
- B' (`lmetric_decode_weight=0.01` with `max(num_req,1)`): decode-aware
|
||||
fallback → −3.5%
|
||||
- RaceFix (atomic reserve before await): kills concurrent-pick
|
||||
collisions → −5.6%
|
||||
|
||||
3. **Race condition was the biggest single hidden bug.** `return await
|
||||
async_func(...)` yields to the event loop **before** the body of
|
||||
`async_func` runs, so reservations done in the body don't take effect
|
||||
in time to deter concurrent picks. This affects ANY async dispatch
|
||||
with separate pick/reserve steps — worth checking other routing
|
||||
policies.
|
||||
|
||||
4. **8007 p90=22s is workload-inherent.** Sessions with 100k+ token turns
|
||||
at 50% cache hit cannot finish faster than 7–15s per turn regardless
|
||||
of routing. Forcing spread would hurt rather than help.
|
||||
|
||||
5. **Migration (v3) is not necessary** when unified routing is tuned
|
||||
well. Save the PD-sep mechanism for cases where it can be proven
|
||||
net-positive (e.g. very-long-output sessions on extremely overloaded
|
||||
prefill hosts) and use unified A+B'+RaceFix as the default.
|
||||
|
||||
---
|
||||
|
||||
## Direction A+B — run pending
|
||||
|
||||
(Will be filled when `unified_of13_lmw001_*/unified/` finishes.)
|
||||
169
microbench/connector_tax/cache_sweep/analyze_b3_replay.py
Executable file
169
microbench/connector_tax/cache_sweep/analyze_b3_replay.py
Executable file
@@ -0,0 +1,169 @@
|
||||
#!/usr/bin/env python3
|
||||
"""B3 5-policy re-test analyser.
|
||||
|
||||
Compute TTFT/TPOT/E2E mean/p50/p90/p99 for each policy from
|
||||
metrics.jsonl, compare against the historical b3_policy_comparison.json
|
||||
that drives fig_b3_latency_bars.png, and emit a side-by-side table
|
||||
plus a new figure with the same layout as the original.
|
||||
|
||||
Usage:
|
||||
python analyze_b3_replay.py --root <outroot> [--old-data <path>] [--figure <path>]
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import statistics
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
POLICIES = ["lmetric", "load_only", "sticky", "unified", "unified_v2"]
|
||||
|
||||
|
||||
def pct(xs, p):
|
||||
if not xs:
|
||||
return None
|
||||
xs = sorted(xs)
|
||||
k = max(0, min(len(xs) - 1, int(p / 100.0 * (len(xs) - 1))))
|
||||
return xs[k]
|
||||
|
||||
|
||||
def summarise(path):
|
||||
rows = [json.loads(l) for l in open(path) if l.strip()]
|
||||
ok = [r for r in rows if not r.get("error")]
|
||||
ttft = [r["ttft_s"] * 1000 for r in ok if r.get("ttft_s") is not None]
|
||||
tpot = [r["tpot_s"] * 1000 for r in ok if r.get("tpot_s")]
|
||||
e2e = [r["latency_s"] * 1000 for r in ok if r.get("latency_s") is not None]
|
||||
return {
|
||||
"n_total": len(rows),
|
||||
"n_ok": len(ok),
|
||||
"ttft_mean_ms": statistics.mean(ttft) if ttft else None,
|
||||
"ttft_p50_ms": pct(ttft, 50),
|
||||
"ttft_p90_ms": pct(ttft, 90),
|
||||
"ttft_p99_ms": pct(ttft, 99),
|
||||
"tpot_mean_ms": statistics.mean(tpot) if tpot else None,
|
||||
"tpot_p50_ms": pct(tpot, 50),
|
||||
"tpot_p90_ms": pct(tpot, 90),
|
||||
"tpot_p99_ms": pct(tpot, 99),
|
||||
"e2e_mean_ms": statistics.mean(e2e) if e2e else None,
|
||||
"e2e_p50_ms": pct(e2e, 50),
|
||||
"e2e_p90_ms": pct(e2e, 90),
|
||||
"e2e_p99_ms": pct(e2e, 99),
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--root", type=Path, required=True)
|
||||
ap.add_argument("--old-data", type=Path,
|
||||
default=Path("analysis/characterization/window_1_results/b3_policy_comparison.json"))
|
||||
ap.add_argument("--figure", type=Path, default=None)
|
||||
args = ap.parse_args()
|
||||
|
||||
new = {}
|
||||
for p in POLICIES:
|
||||
path = args.root / p / "metrics.jsonl"
|
||||
if not path.exists():
|
||||
print(f"MISSING: {path}")
|
||||
continue
|
||||
new[p] = summarise(path)
|
||||
|
||||
old = {}
|
||||
if args.old_data.exists():
|
||||
d = json.load(open(args.old_data))
|
||||
for r in d.get("rows", []):
|
||||
old[r["policy"]] = {
|
||||
"ttft_p50_ms": r["ttft_p50_s"] * 1000,
|
||||
"ttft_p90_ms": r["ttft_p90_s"] * 1000,
|
||||
"ttft_p99_ms": r["ttft_p99_s"] * 1000,
|
||||
"tpot_p90_ms": r["tpot_p90_s"] * 1000,
|
||||
"e2e_p90_ms": r.get("e2e_p90_s", 0) * 1000,
|
||||
}
|
||||
|
||||
def fmt(v): return f"{v:.0f}" if v is not None else "-"
|
||||
def pctd(a, b):
|
||||
if a is None or b is None or a == 0: return "-"
|
||||
return f"{(b/a-1)*100:+.1f}%"
|
||||
|
||||
# Headline table
|
||||
print(f"\n# NEW: today's re-test")
|
||||
print(f"{'policy':<14}{'n_ok':>6}{'TTFTp50':>10}{'TTFTp90':>10}{'TTFTp99':>10}{'TPOTp90':>10}{'E2Ep90':>10}")
|
||||
print("-" * 70)
|
||||
for p in POLICIES:
|
||||
if p not in new: continue
|
||||
r = new[p]
|
||||
print(f"{p:<14}{r['n_ok']:>6}{fmt(r['ttft_p50_ms']):>9}ms{fmt(r['ttft_p90_ms']):>9}ms{fmt(r['ttft_p99_ms']):>9}ms{fmt(r['tpot_p90_ms']):>9}ms{fmt(r['e2e_p90_ms']):>9}ms")
|
||||
|
||||
print(f"\n# OLD: window_1_results/b3_policy_comparison.json")
|
||||
print(f"{'policy':<14}{'TTFTp50':>10}{'TTFTp90':>10}{'TTFTp99':>10}{'TPOTp90':>10}{'E2Ep90':>10}")
|
||||
print("-" * 60)
|
||||
for p in POLICIES:
|
||||
if p not in old: continue
|
||||
r = old[p]
|
||||
print(f"{p:<14}{fmt(r['ttft_p50_ms']):>9}ms{fmt(r['ttft_p90_ms']):>9}ms{fmt(r['ttft_p99_ms']):>9}ms{fmt(r['tpot_p90_ms']):>9}ms{fmt(r['e2e_p90_ms']):>9}ms")
|
||||
|
||||
print(f"\n# DRIFT: today vs old (same policy)")
|
||||
print(f"{'policy':<14}{'ΔTTFTp50':>10}{'ΔTTFTp90':>10}{'ΔTTFTp99':>10}{'ΔTPOTp90':>10}{'ΔE2Ep90':>10}")
|
||||
print("-" * 60)
|
||||
for p in POLICIES:
|
||||
if p not in new or p not in old: continue
|
||||
n, o = new[p], old[p]
|
||||
print(f"{p:<14}{pctd(o['ttft_p50_ms'], n['ttft_p50_ms']):>10}"
|
||||
f"{pctd(o['ttft_p90_ms'], n['ttft_p90_ms']):>10}"
|
||||
f"{pctd(o['ttft_p99_ms'], n['ttft_p99_ms']):>10}"
|
||||
f"{pctd(o['tpot_p90_ms'], n['tpot_p90_ms']):>10}"
|
||||
f"{pctd(o['e2e_p90_ms'], n['e2e_p90_ms']):>10}")
|
||||
|
||||
# Relative ordering check
|
||||
def ranks(values_dict, key):
|
||||
items = [(p, r[key]) for p, r in values_dict.items() if r.get(key)]
|
||||
items.sort(key=lambda x: x[1])
|
||||
return [p for p, _ in items]
|
||||
|
||||
print(f"\n# TTFT p90 ranking (best → worst)")
|
||||
for label, src in [("OLD", old), ("NEW", new)]:
|
||||
if src:
|
||||
order = ranks(src, "ttft_p90_ms")
|
||||
print(f" {label}: {' < '.join(order)}")
|
||||
|
||||
out = {"new": new, "old": old}
|
||||
out_path = args.root / "b3_replay_summary.json"
|
||||
out_path.write_text(json.dumps(out, indent=2))
|
||||
print(f"\nWrote {out_path}")
|
||||
|
||||
# Bar plot (matplotlib)
|
||||
if not args.figure:
|
||||
args.figure = args.root / "fig_b3_latency_bars_new.png"
|
||||
try:
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
pols = [p for p in POLICIES if p in new]
|
||||
metrics = [("TTFT p90 (s)", "ttft_p90_ms", 1000),
|
||||
("TPOT p90 (ms)", "tpot_p90_ms", 1),
|
||||
("E2E p90 (s)", "e2e_p90_ms", 1000)]
|
||||
colors = {"lmetric": "tab:blue", "load_only": "tab:orange",
|
||||
"sticky": "tab:green", "unified": "tab:red",
|
||||
"unified_v2": "tab:purple"}
|
||||
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
|
||||
for ax, (label, key, div) in zip(axes, metrics):
|
||||
vals = [new[p][key] / div for p in pols]
|
||||
bars = ax.bar(pols, vals,
|
||||
color=[colors.get(p, "gray") for p in pols],
|
||||
edgecolor="black", linewidth=0.5)
|
||||
ax.set_title(label)
|
||||
ax.tick_params(axis="x", rotation=20)
|
||||
for b, v in zip(bars, vals):
|
||||
ax.text(b.get_x() + b.get_width() / 2, v, f"{v:.1f}",
|
||||
ha="center", va="bottom", fontsize=9)
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
fig.suptitle(f"B3 5-policy re-test ({args.root.name})")
|
||||
fig.tight_layout()
|
||||
fig.savefig(args.figure, dpi=120)
|
||||
print(f"Wrote {args.figure}")
|
||||
except Exception as e:
|
||||
print(f"(figure skipped: {e})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
94
microbench/connector_tax/cache_sweep/run_b3_replay.sh
Executable file
94
microbench/connector_tax/cache_sweep/run_b3_replay.sh
Executable file
@@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env bash
|
||||
# B3 routing-policy reproducibility re-test.
|
||||
#
|
||||
# Re-runs the 5 routing policies from fig_b3_latency_bars.png on the same
|
||||
# trace, in a single same-day session, to check whether the ordering
|
||||
# (unified < load_only < sticky etc.) still holds today.
|
||||
#
|
||||
# Policies (in run order):
|
||||
# lmetric plain — cache-aware P_tokens × BS
|
||||
# load_only plain — pure min-num_requests
|
||||
# sticky plain — hard session affinity
|
||||
# unified plain — hybrid affinity + LMetric fallback
|
||||
# unified_v2 Mooncake kv_both + selective PD-sep (with DR-fix applied)
|
||||
#
|
||||
# unified_v2 is run with VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1 so we
|
||||
# get the "best Mooncake state" we have today (DR-fix on top of the
|
||||
# already-fixed mainline after e3a1d70 etc.). The other 4 policies don't
|
||||
# load any connector so the patch is irrelevant.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||
DATE="$(date +%Y%m%d_%H%M)"
|
||||
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_replay_${DATE}}"
|
||||
PYTHON="$PROJ_DIR/.venv/bin/python"
|
||||
DR_FIX_SCRIPT="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
||||
|
||||
mkdir -p "$OUTROOT"
|
||||
echo "=== B3 5-policy re-test ==="
|
||||
echo "Trace : $TRACE"
|
||||
echo "Out : $OUTROOT"
|
||||
echo "Order : lmetric → load_only → sticky → unified → unified_v2 (DR-fix on)"
|
||||
echo ""
|
||||
|
||||
cleanup_all() {
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||
}
|
||||
trap cleanup_all EXIT
|
||||
cleanup_all
|
||||
|
||||
# Apply DR-fix once — it's env-gated so only unified_v2 (with env=1) sees it
|
||||
echo "[stage 0] applying CT_DR_FIX (env-gated, only activates when VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1)"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --apply --vllm-root "$VLLM_ROOT"
|
||||
|
||||
run_policy() {
|
||||
local policy="$1"
|
||||
local skip_dr="$2"
|
||||
local rundir="$OUTROOT/$policy"
|
||||
mkdir -p "$rundir"
|
||||
|
||||
echo ""
|
||||
echo "====== $policy ; DR_SYNC_DISABLED=$skip_dr ======"
|
||||
|
||||
if [ "$skip_dr" = "1" ]; then
|
||||
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||
else
|
||||
unset VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC
|
||||
fi
|
||||
|
||||
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "$policy" "$TRACE" "$rundir" \
|
||||
2>&1 | tee "$rundir/orchestrator.log" | tail -30
|
||||
rc="${PIPESTATUS[0]}"
|
||||
if [ "$rc" != "0" ]; then
|
||||
echo "[FAIL] policy $policy rc=$rc"
|
||||
fi
|
||||
# Belt-and-braces cleanup between policies
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 10
|
||||
return 0
|
||||
}
|
||||
|
||||
run_policy "lmetric" "0"
|
||||
run_policy "load_only" "0"
|
||||
run_policy "sticky" "0"
|
||||
run_policy "unified" "0"
|
||||
run_policy "unified_v2" "1" # uses Mooncake kv_both; activate DR-fix
|
||||
|
||||
echo ""
|
||||
echo "[stage Z] reverting CT_DR_FIX"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT"
|
||||
|
||||
echo ""
|
||||
echo "Done. Artifacts: $OUTROOT"
|
||||
for p in lmetric load_only sticky unified unified_v2; do
|
||||
echo " $p: $OUTROOT/$p/metrics.jsonl"
|
||||
done
|
||||
56
microbench/connector_tax/cache_sweep/run_unified_ablation.sh
Normal file
56
microbench/connector_tax/cache_sweep/run_unified_ablation.sh
Normal file
@@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env bash
|
||||
# Single-policy trace replay for unified, with tunable overload-factor.
|
||||
# Used to test direction A: does tightening affinity overflow improve unified?
|
||||
#
|
||||
# Usage:
|
||||
# OVERLOAD_FACTOR=1.3 bash run_unified_ablation.sh
|
||||
# OVERLOAD_FACTOR=1.0 bash run_unified_ablation.sh
|
||||
#
|
||||
# Output: $PROJ_DIR/outputs/unified_of${OF}_${DATE}/unified/
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||
OF="${OVERLOAD_FACTOR:-1.3}"
|
||||
LMW="${LMETRIC_DECODE_WEIGHT:-0.0}"
|
||||
TAG_DEFAULT="of${OF/./}"
|
||||
if [ "$(printf '%s' "$LMW" | grep -v '^0\.\?0*$' || true)" != "" ]; then
|
||||
TAG_DEFAULT="${TAG_DEFAULT}_lmw${LMW/./}"
|
||||
fi
|
||||
TAG="${TAG:-$TAG_DEFAULT}"
|
||||
DATE="$(date +%Y%m%d_%H%M)"
|
||||
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/unified_${TAG}_${DATE}}"
|
||||
|
||||
mkdir -p "$OUTROOT"
|
||||
echo "=== unified ablation: overload_factor=$OF ==="
|
||||
echo "Trace : $TRACE"
|
||||
echo "Out : $OUTROOT"
|
||||
echo ""
|
||||
|
||||
cleanup() {
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 3
|
||||
}
|
||||
trap cleanup EXIT
|
||||
cleanup
|
||||
|
||||
cfg_dir="$OUTROOT/unified"
|
||||
mkdir -p "$cfg_dir"
|
||||
|
||||
export EXTRA_PROXY_ARGS="--overload-factor $OF --lmetric-decode-weight $LMW"
|
||||
|
||||
echo ""
|
||||
echo "====== unified ; overload_factor=$OF lmetric_decode_weight=$LMW ======"
|
||||
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified" "$TRACE" "$cfg_dir" \
|
||||
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -30
|
||||
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
|
||||
echo ""
|
||||
echo "Done. Artifacts: $OUTROOT/unified/metrics.jsonl"
|
||||
64
microbench/connector_tax/cache_sweep/run_v3_norot_replay.sh
Normal file
64
microbench/connector_tax/cache_sweep/run_v3_norot_replay.sh
Normal file
@@ -0,0 +1,64 @@
|
||||
#!/usr/bin/env bash
|
||||
# Trace replay for unified_v3 WITHOUT affinity rotation.
|
||||
#
|
||||
# This is the #2 follow-up to cache_miss_audit: prior run showed v3 with
|
||||
# rotation hits 9.5% cache on post-migration next turn vs 80.6% for unified.
|
||||
# Hypothesis: rotation destroys prefix cache locality. Test by keeping the
|
||||
# session affinity on prefill_host even after migration (i.e., the same
|
||||
# behavior as unified for the post-migration write), so only the *current*
|
||||
# turn's decode is migrated.
|
||||
#
|
||||
# Applies CT_DR_FIX (Mooncake DR sync disabled).
|
||||
# Output: $PROJ_DIR/outputs/b3_v3_norot_${DATE}/unified_v3/
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||
DATE="$(date +%Y%m%d_%H%M)"
|
||||
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_norot_${DATE}}"
|
||||
PYTHON="$PROJ_DIR/.venv/bin/python"
|
||||
DR_FIX_SCRIPT="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
||||
|
||||
mkdir -p "$OUTROOT"
|
||||
echo "=== unified_v3 (no rotation) trace replay ==="
|
||||
echo "Trace : $TRACE"
|
||||
echo "Out : $OUTROOT"
|
||||
echo ""
|
||||
|
||||
cleanup_all() {
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||
}
|
||||
trap cleanup_all EXIT
|
||||
cleanup_all
|
||||
|
||||
echo "[stage 0] applying CT_DR_FIX (env-gated)"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --apply --vllm-root "$VLLM_ROOT"
|
||||
|
||||
cfg_dir="$OUTROOT/unified_v3"
|
||||
mkdir -p "$cfg_dir"
|
||||
|
||||
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||
export EXTRA_PROXY_ARGS="--v3-rotate-affinity 0"
|
||||
|
||||
echo ""
|
||||
echo "====== unified_v3 (no rotation) ; DR_SYNC_DISABLED=1 ======"
|
||||
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
|
||||
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -30
|
||||
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
|
||||
echo ""
|
||||
echo "[stage Z] reverting CT_DR_FIX"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT"
|
||||
|
||||
echo ""
|
||||
echo "Done. Artifacts: $OUTROOT/unified_v3/metrics.jsonl"
|
||||
66
microbench/connector_tax/cache_sweep/run_v3_replay.sh
Executable file
66
microbench/connector_tax/cache_sweep/run_v3_replay.sh
Executable file
@@ -0,0 +1,66 @@
|
||||
#!/usr/bin/env bash
|
||||
# Trace replay for the new unified_v3 (offload-decode) policy.
|
||||
#
|
||||
# Runs the same trace as run_b3_replay.sh on a single policy:
|
||||
# unified_v3 — prefill on session-affinity host (uses prefix cache),
|
||||
# decode migrated to a low-load target via Mooncake
|
||||
# KV transfer (kv_role=kv_both). Session affinity rotates
|
||||
# to decode_target after migration so next turn lands
|
||||
# where the KV now lives.
|
||||
#
|
||||
# Applies CT_DR_FIX so the run uses the "best Mooncake state" we have
|
||||
# today (post-e3a1d70 + DR sync skipped).
|
||||
#
|
||||
# Usage: bash run_v3_replay.sh
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
PROJ_DIR="${PROJ_DIR:-/home/admin/cpfs/wjh/agentic-kv}"
|
||||
TRACE="${TRACE:-$PROJ_DIR/traces/w600_r0.0015_st30.jsonl}"
|
||||
DATE="$(date +%Y%m%d_%H%M)"
|
||||
OUTROOT="${OUTROOT:-$PROJ_DIR/outputs/b3_v3_${DATE}}"
|
||||
PYTHON="$PROJ_DIR/.venv/bin/python"
|
||||
DR_FIX_SCRIPT="$PROJ_DIR/microbench/connector_tax/cache_sweep/apply_direct_read_fix.py"
|
||||
VLLM_ROOT="${VLLM_ROOT:-$PROJ_DIR/.venv/lib/python3.12/site-packages/vllm}"
|
||||
|
||||
mkdir -p "$OUTROOT"
|
||||
echo "=== unified_v3 (offload-decode) trace replay ==="
|
||||
echo "Trace : $TRACE"
|
||||
echo "Out : $OUTROOT"
|
||||
echo ""
|
||||
|
||||
cleanup_all() {
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT" 2>/dev/null || true
|
||||
}
|
||||
trap cleanup_all EXIT
|
||||
cleanup_all
|
||||
|
||||
echo "[stage 0] applying CT_DR_FIX (env-gated)"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --apply --vllm-root "$VLLM_ROOT"
|
||||
|
||||
cfg_dir="$OUTROOT/unified_v3"
|
||||
mkdir -p "$cfg_dir"
|
||||
|
||||
# Activate the DR-fix env-gate (unified_v3 uses Mooncake kv_both)
|
||||
export VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1
|
||||
|
||||
echo ""
|
||||
echo "====== unified_v3 ; DR_SYNC_DISABLED=1 ======"
|
||||
bash "$PROJ_DIR/scripts/b3_isolated_policy.sh" "unified_v3" "$TRACE" "$cfg_dir" \
|
||||
2>&1 | tee "$cfg_dir/orchestrator.log" | tail -30
|
||||
|
||||
pkill -9 -f cache_aware_proxy 2>/dev/null || true
|
||||
pkill -9 -f "vllm serve" 2>/dev/null || true
|
||||
pkill -9 -f "EngineCore" 2>/dev/null || true
|
||||
sleep 5
|
||||
|
||||
echo ""
|
||||
echo "[stage Z] reverting CT_DR_FIX"
|
||||
"$PYTHON" "$DR_FIX_SCRIPT" --revert --vllm-root "$VLLM_ROOT"
|
||||
|
||||
echo ""
|
||||
echo "Done. Artifacts: $OUTROOT/unified_v3/metrics.jsonl"
|
||||
@@ -19,7 +19,7 @@ import os
|
||||
import time as _time
|
||||
import urllib.parse
|
||||
import uuid
|
||||
from collections import OrderedDict
|
||||
from collections import OrderedDict, deque
|
||||
from contextlib import asynccontextmanager
|
||||
from dataclasses import dataclass
|
||||
|
||||
@@ -103,6 +103,20 @@ class Settings:
|
||||
# auto-transfers only the missing portion (verified via
|
||||
# smoke_partial_transfer: cache-rich dst is 77% faster than
|
||||
# cold dst at 33k tokens, +512 ext).
|
||||
# Anti-hotspot: picker scores effective_load = num_requests + (recent
|
||||
# migrations received within window). Prevents clustering migrations on
|
||||
# one instance in rapid succession (observed in Mech B run: inst_5 became
|
||||
# a hotspot via post-rotation tail accumulation).
|
||||
v3_recent_mig_window_s: float = 10.0 # sliding window
|
||||
v3_recent_mig_weight: float = 1.0 # how many "virtual requests" each
|
||||
# recent migration counts as
|
||||
|
||||
# Direction B knob: LMetric fallback adds decode-token penalty to score.
|
||||
# score = (pending_prefill + new + lmetric_decode_weight * ongoing_decode_tok) * num_req
|
||||
# Empirical iter-time slope on H100 + Qwen3-30B-A3B: each decode token in
|
||||
# batch costs ~0.01 prefill-token-equivalent in scheduler time, so 0.01 is
|
||||
# a reasonable starting weight. Set 0 to disable (original behavior).
|
||||
lmetric_decode_weight: float = 0.0
|
||||
|
||||
# --- KV connector selection (governs PD-sep handshake) -------------
|
||||
# "mooncake": pre-baked kv_transfer_params (bootstrap_addr+engine_id+transfer_id).
|
||||
@@ -187,6 +201,11 @@ class InstanceState:
|
||||
self.dp_size = 1
|
||||
# OrderedDict acts as an LRU keyed by block hash; value is unused.
|
||||
self.cached_blocks: OrderedDict[int, None] = OrderedDict()
|
||||
# v3 anti-hotspot: timestamps (monotonic) when this instance was picked
|
||||
# as a v3 migration target. Used to compute effective_load = num_req +
|
||||
# recent-migration count over a sliding window, preventing back-to-back
|
||||
# decisions from clustering on the same dst.
|
||||
self.recent_mig_targeted_at: deque[float] = deque(maxlen=64)
|
||||
|
||||
def estimate_cache_hit(self, token_ids: list[int] | None) -> int:
|
||||
if not token_ids or len(token_ids) < BLOCK_SIZE:
|
||||
@@ -417,13 +436,24 @@ def pick_instance_unified_hybrid(
|
||||
decision["chosen_idx"] = a_idx
|
||||
return a_inst, a_idx, decision
|
||||
|
||||
keys: list[tuple[int, int, int, int]] = []
|
||||
# Direction B: extend LMetric with decode-load awareness.
|
||||
# Original score = (pending_prefill + new_uncached) * num_requests, which
|
||||
# ignores ongoing decode work. A host with 200k decode tokens looks "ideal"
|
||||
# (P_tokens=0) but its decode iters are slow due to large batch KV reads.
|
||||
#
|
||||
# First attempt (BUG): score = (p_tokens + decode_pen) * num_req — when
|
||||
# num_req=0 the decode_pen is zeroed out, so idle-but-decoding hosts still
|
||||
# look free and accumulate cold prefills (8007 hotspot in A+B v1 run).
|
||||
#
|
||||
# Fix: max(num_req, 1) so decode_pen contributes on idle hosts too.
|
||||
keys: list[tuple[float, int, int, int]] = []
|
||||
for i, inst in enumerate(instances):
|
||||
cache_hit = inst.estimate_cache_hit(token_ids)
|
||||
new_prefill = max(0, input_length - cache_hit)
|
||||
p_tokens = inst.pending_prefill_tokens + new_prefill
|
||||
decode_pen = SETTINGS.lmetric_decode_weight * inst.ongoing_decode_tokens
|
||||
bs = inst.num_requests
|
||||
score = p_tokens * bs
|
||||
score = (p_tokens + decode_pen) * max(bs, 1)
|
||||
keys.append((score, new_prefill, bs, i))
|
||||
|
||||
best_triple = min(k[:3] for k in keys)
|
||||
@@ -637,48 +667,80 @@ def pick_instance_unified_v3(
|
||||
)
|
||||
return prefill_host, prefill_idx, decision, None
|
||||
|
||||
# Gate 3: pick the lowest-load target that is materially less loaded
|
||||
# than the prefill_host. Cache content irrelevant — KV ships over.
|
||||
# Gate 3: pick the lowest-effective-load target. effective_load adds a
|
||||
# penalty for recent migrations the instance has received (anti-hotspot).
|
||||
now_mono = _time.monotonic()
|
||||
cutoff = now_mono - SETTINGS.v3_recent_mig_window_s
|
||||
|
||||
def effective_load(inst):
|
||||
# Drop expired entries lazily.
|
||||
while inst.recent_mig_targeted_at and inst.recent_mig_targeted_at[0] < cutoff:
|
||||
inst.recent_mig_targeted_at.popleft()
|
||||
recent = len(inst.recent_mig_targeted_at)
|
||||
return inst.num_requests + recent * SETTINGS.v3_recent_mig_weight
|
||||
|
||||
threshold_loaded = max(1,
|
||||
int(prefill_host.num_requests * SETTINGS.v3_target_load_ratio))
|
||||
candidates = [
|
||||
(i, inst) for i, inst in enumerate(instances)
|
||||
if i != prefill_idx
|
||||
and inst.num_requests < threshold_loaded
|
||||
and inst.num_requests <= prefill_host.num_requests - SETTINGS.v3_min_load_gap
|
||||
and effective_load(inst) < threshold_loaded
|
||||
and effective_load(inst) <= prefill_host.num_requests - SETTINGS.v3_min_load_gap
|
||||
]
|
||||
if not candidates:
|
||||
decision["v3_reason"] = (
|
||||
f"no_low_load_target "
|
||||
f"(prefill_host.num_req={prefill_host.num_requests} "
|
||||
f"threshold={threshold_loaded})"
|
||||
f"threshold={threshold_loaded} "
|
||||
f"eff_loads=[{','.join(f'{int(effective_load(i))}' for i in instances)}])"
|
||||
)
|
||||
return prefill_host, prefill_idx, decision, None
|
||||
|
||||
# Mechanism B (v3_prefer_cache_target=True): rank candidates first by
|
||||
# cache_hit DESC (more cache = less KV to transfer), then by load. vLLM
|
||||
# auto-skips transferring overlapping prefix when dst's local cache
|
||||
# matches — verified in smoke_partial_transfer: 77% faster on a 33k
|
||||
# prompt when dst has the prefix already.
|
||||
# cache_hit DESC (more cache = less KV to transfer), then by effective_load
|
||||
# (which includes recent-migration penalty), then by ongoing_tokens.
|
||||
if SETTINGS.v3_prefer_cache_target:
|
||||
decode_target_idx, decode_target = min(
|
||||
candidates,
|
||||
key=lambda x: (-x[1].estimate_cache_hit(token_ids),
|
||||
x[1].num_requests, x[1].ongoing_tokens))
|
||||
effective_load(x[1]),
|
||||
x[1].ongoing_tokens))
|
||||
else:
|
||||
decode_target_idx, decode_target = min(
|
||||
candidates, key=lambda x: (x[1].num_requests, x[1].ongoing_tokens))
|
||||
candidates, key=lambda x: (effective_load(x[1]), x[1].ongoing_tokens))
|
||||
|
||||
target_cache_hit = decode_target.estimate_cache_hit(token_ids)
|
||||
target_recent_received = len(decode_target.recent_mig_targeted_at)
|
||||
# Record this decision for the anti-hotspot accounting.
|
||||
decode_target.recent_mig_targeted_at.append(now_mono)
|
||||
|
||||
decision["v3_migrate"] = True
|
||||
decision["v3_decision"] = "migrate_decode"
|
||||
decision["v3_src_idx"] = prefill_idx
|
||||
decision["v3_target_idx"] = decode_target_idx
|
||||
decision["v3_target_num_req"] = decode_target.num_requests
|
||||
decision["v3_target_cache_hit"] = target_cache_hit
|
||||
decision["v3_target_recent_received"] = target_recent_received
|
||||
decision["v3_prefill_num_req"] = prefill_host.num_requests
|
||||
# Snapshot of src state at the moment of decision (for postmortem).
|
||||
decision["v3_src_state"] = {
|
||||
"num_requests": prefill_host.num_requests,
|
||||
"ongoing_tokens": prefill_host.ongoing_tokens,
|
||||
"ongoing_decode_tokens": prefill_host.ongoing_decode_tokens,
|
||||
"pending_prefill_tokens": prefill_host.pending_prefill_tokens,
|
||||
}
|
||||
decision["v3_target_state"] = {
|
||||
"num_requests": decode_target.num_requests,
|
||||
"ongoing_tokens": decode_target.ongoing_tokens,
|
||||
"ongoing_decode_tokens": decode_target.ongoing_decode_tokens,
|
||||
"pending_prefill_tokens": decode_target.pending_prefill_tokens,
|
||||
"cache_hit_estimate": target_cache_hit,
|
||||
"recent_mig_received_in_window": target_recent_received,
|
||||
}
|
||||
decision["v3_reason"] = (
|
||||
f"prefill_host.num_req={prefill_host.num_requests} busy; "
|
||||
f"target.num_req={decode_target.num_requests} cache_hit={target_cache_hit}, "
|
||||
f"target.num_req={decode_target.num_requests} cache_hit={target_cache_hit} "
|
||||
f"recent_received={target_recent_received}, "
|
||||
f"transferring KV after prefill"
|
||||
)
|
||||
return prefill_host, prefill_idx, decision, (decode_target, decode_target_idx)
|
||||
@@ -987,9 +1049,13 @@ async def _handle(request: Request, api: str):
|
||||
|
||||
async def _handle_local_request(api, req_data, headers, token_ids, input_length,
|
||||
chosen: InstanceState, estimated_new: int,
|
||||
breakdown: dict):
|
||||
breakdown: dict, *, _pre_reserved: bool = False):
|
||||
breakdown.setdefault("route_class", "LOCAL")
|
||||
breakdown.setdefault("routed_to", chosen.url)
|
||||
# Skip reservation when called from _handle_combined (it already reserved
|
||||
# synchronously to close the picker→await race). When called directly
|
||||
# from non-combined paths (PD-Sep, offload), reserve here for safety.
|
||||
if not _pre_reserved:
|
||||
chosen.ongoing_tokens += input_length
|
||||
chosen.pending_prefill_tokens += estimated_new
|
||||
chosen.num_requests += 1
|
||||
@@ -1180,9 +1246,19 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
|
||||
src_inst, chosen, breakdown,
|
||||
request_id=request_id)
|
||||
|
||||
# Race fix: reserve load on `chosen` BEFORE the `await` so concurrent
|
||||
# picker calls in the same asyncio event-loop tick see the updated
|
||||
# counters. Without this, two requests arriving back-to-back can both
|
||||
# pick the same "free" instance and both end up running there
|
||||
# simultaneously (observed as 8007 hotspot in A+B run).
|
||||
chosen.ongoing_tokens += input_length
|
||||
chosen.pending_prefill_tokens += estimated_new
|
||||
chosen.num_requests += 1
|
||||
breakdown.setdefault("route_class", "LOCAL")
|
||||
breakdown.setdefault("routed_to", chosen.url)
|
||||
return await _handle_local_request(
|
||||
api, req_data, headers, token_ids, input_length,
|
||||
chosen, estimated_new, breakdown)
|
||||
chosen, estimated_new, breakdown, _pre_reserved=True)
|
||||
|
||||
|
||||
async def _handle_combined_pd_sep_v2(
|
||||
@@ -1545,6 +1621,10 @@ def parse_args():
|
||||
help="Mechanism B: unified_v3 picks decode_target with the most"
|
||||
" prefix cache among low-load candidates (default 1). Set 0"
|
||||
" to fall back to pure-load tie-break (cache-blind).")
|
||||
p.add_argument("--lmetric-decode-weight", type=float, default=0.0,
|
||||
help="Direction B: LMetric fallback adds this × ongoing_decode_tokens"
|
||||
" to the queue-depth score, so hosts with heavy decode load get"
|
||||
" penalised. 0 = original behavior; 0.01 is a reasonable start.")
|
||||
p.add_argument("--overload-factor", type=float, default=2.0,
|
||||
help="Break session affinity when instance load > factor * avg")
|
||||
# The four flags below are accepted for bench.sh backward compatibility but
|
||||
@@ -1585,11 +1665,13 @@ if __name__ == "__main__":
|
||||
SETTINGS.v3_rotate_affinity = bool(getattr(global_args, 'v3_rotate_affinity', 1))
|
||||
SETTINGS.connector_type = getattr(global_args, 'connector_type', 'mooncake')
|
||||
SETTINGS.v3_prefer_cache_target = bool(getattr(global_args, 'v3_prefer_cache_target', 1))
|
||||
SETTINGS.lmetric_decode_weight = float(getattr(global_args, 'lmetric_decode_weight', 0.0))
|
||||
print("SETTINGS: throughput=%.0f rdma_overhead=%.2f offload=%s v3_rotate_affinity=%s "
|
||||
"connector_type=%s v3_prefer_cache_target=%s" % (
|
||||
"connector_type=%s v3_prefer_cache_target=%s lmetric_decode_weight=%.3f" % (
|
||||
SETTINGS.prefill_throughput, SETTINGS.rdma_overhead_s,
|
||||
getattr(global_args, 'offload', False),
|
||||
SETTINGS.v3_rotate_affinity,
|
||||
SETTINGS.connector_type,
|
||||
SETTINGS.v3_prefer_cache_target))
|
||||
SETTINGS.v3_prefer_cache_target,
|
||||
SETTINGS.lmetric_decode_weight))
|
||||
uvicorn.run(app, host=global_args.host, port=global_args.port)
|
||||
|
||||
Reference in New Issue
Block a user