v2 exp(d): 5-policy routing under tracets vs thinktime — ranking flip

Extends exp(c) (dispatch ablation, 1 round-robin policy) to the full 5-policy
routing comparison, both modes on the SAME ttp trace (807 reqs, fresh vLLM/arm,
dash0 8xH20). Confirms exp(c)'s prediction and finds something stronger: the
dispatch mode FLIPS which policy wins.

- thinktime helps every policy but helps LPWL most (TTFT p90 -40%, E2E mean -31%
  vs -3..-16% for the rest): tracets bursts punish prefill-spreading.
- Ranking flip: tracets -> LPWL only ties unified_ab on TTFT p90 and is 3rd on
  E2E mean; thinktime -> LPWL is 1st on both (TTFT p90 -31%, best TPOT/balance,
  zero knobs) vs the tuned unified+A+B.
- => benchmark agentic routing with thinktime; tracets' burst artifact erases
  LPWL's advantage. Caveat n=1: tracets ranking is run-sensitive (does not
  reproduce dash1 lpwl_5policy_600s.md), the thinktime advantage is the robust
  signal (appears in both environments).

README + grouped-bar fig (figs/exp_d_policy_dispatch.png) + bench_report
summaries in results/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-30 20:59:18 +08:00
parent 68f21bef23
commit 9b6091fe6e
5 changed files with 1788 additions and 0 deletions

View File

@@ -0,0 +1,114 @@
# exp (d) — 5-policy routing under `tracets` vs `thinktime`
exp (c) showed the **dispatch mode** changes measured performance for a single
round-robin policy, and predicted: *"a cache-aware policy (LPWL) would lower the
latencies and likely **widen** the thinktime advantage."* exp (d) tests that with
the full routing comparison — and finds something stronger: **the dispatch mode
flips which policy wins.**
**Question.** Does the parameter-free LPWL still beat the tuned `unified+A+B`
baseline once we benchmark with the *faithful* `thinktime` load instead of the
`tracets` burst artifact?
## Setup
5 routing policies, each its own **fresh vLLM (cold APC)** on dash0 8×H20,
Qwen3-Coder-30B-A3B, via `scripts/b3_isolated_policy.sh`. **Both dispatch modes
run on the *same* trace** `traces/w600_r0.0015_st30_first600s_ttp.jsonl` (807
reqs, 274 sessions) — the only variable is `REPLAY_DISPATCH_MODE`
(`tracets` ignores the `time_to_parent_chat` field, `thinktime` consumes it).
Analyzer: `scripts/bench_report.py` (summaries in `results/`).
- `leastwork`**LPWL**, parameter-free (`pending_prefill + max(0, inputcache_hit)`)
- `unified_ab` — unified hybrid, tuned A+B (`of=1.3, lmw=0.01`)
- `unified_def` — unified hybrid, defaults (`of=2.0, lmw=0.0`)
- `lmetric` — P_tokens × BS, no affinity
- `sticky` — hard session affinity
## Result (ms; `figs/exp_d_policy_dispatch.png`)
| policy | mode | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---|---:|---:|---:|---:|---:|---:|---:|
| **LPWL** | tracets | 11099 | 9827 | 25366 | 93929 | 33 | 0.650 | **1.49×** |
| **LPWL** | **thinktime** | **6713** | **6788** | **17635** | 69946 | **18** | 0.676 | 1.94× |
| unified+A+B | tracets | 10783 | 8531 | 22063 | 75419 | 21 | 0.667 | 1.54× |
| unified+A+B | thinktime | 9736 | 7131 | 18690 | **63788** | 19 | 0.676 | 2.16× |
| unified default | tracets | 12997 | 8366 | 22819 | 82257 | 20 | 0.693 | 1.56× |
| unified default | thinktime | 11268 | 7975 | 24096 | 72334 | 22 | 0.693 | 2.91× |
| LMetric | tracets | 16492 | 10775 | 27791 | 99231 | 39 | 0.495 | 2.19× |
| LMetric | thinktime | 15607 | 9902 | 27819 | 73672 | 30 | 0.483 | 2.10× |
| sticky | tracets | 15236 | 10139 | 27974 | 82362 | 31 | 0.693 | 2.06× |
| sticky | thinktime | 14838 | 8663 | 24966 | 70933 | 24 | 0.694 | 2.48× |
### Finding 1 — `thinktime` helps every policy, but helps **LPWL the most**
Per-policy `tracets``thinktime` change (negative = thinktime better):
| policy | ΔTTFT p90 | ΔE2E mean | ΔTPOT p90 |
|---|---:|---:|---:|
| **LPWL** | **40%** | **31%** | **45%** |
| unified+A+B | 10% | 16% | 10% |
| unified default | 13% | 5% | +10% |
| LMetric | 5% | 8% | 23% |
| sticky | 3% | 15% | 23% |
`tracets` collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts
→ peak concurrency → KV pressure → preemption. Those bursts punish exactly the
policy that spreads prefill thinly across hosts (LPWL keeps the tightest request
balance, 1.49×), because under a burst the spread sacrifices locality without the
slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.
### Finding 2 — the dispatch mode **flips the cross-policy ranking**
- **TTFT p90:** `tracets``unified_ab (10.8s) ≈ LPWL (11.1s)` — LPWL only *ties*,
even slightly behind. `thinktime`**LPWL (6.7s)** < unified_ab (9.7s): LPWL is
first, **31%** vs the tuned baseline.
- **E2E mean:** `tracets` unified_def (8.4s) < unified_ab (8.5s) < **LPWL (9.8s)**
LPWL is *3rd, behind both unified variants*. `thinktime` **LPWL (6.8s)** <
unified_ab (7.1s) < unified_def (8.0s): LPWL is **first**.
So under artificial `tracets` bursts the parameter-free policy looks tied-or-worse;
under the faithful `thinktime` load it is the clear winner on TTFT and E2E, at
zero knobs and best balance.
## Conclusion
**Benchmark agentic routing with `thinktime`. Under it, the parameter-free LPWL is
the best of the five policies** TTFT p90 31%, E2E mean 5% / p90 6%, best TPOT,
tightest balance vs the *tuned* `unified+A+B` and the `tracets` burst artifact is
precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both
confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first
routing story: faithful load rewards keeping the active working set GPU-resident.
## Caveats
- **n = 1 per arm.** The `tracets` ranking here does **not** reproduce the earlier
dash1 `analysis/lpwl_5policy_600s.md` (which saw LPWL win TTFT p90 31% *in
tracets*); on dash0 `tracets` it is a tie. i.e. **`tracets` rankings are
run/harness-sensitive** the robust signal is the `thinktime` advantage, which
appears in *both* environments. Repeat ×3 to bound noise.
- LPWL's one persistent weak spot is **E2E p99** (thinktime 69.9s vs unified_ab
63.8s) the structural HEAVY+ >50k decode tail, identical across policies, not
routing-fixable (see `lpwl_5policy_600s.md` κ-ablation).
- `thinktime` advantage is a capacity-slack effect; under saturation the modes
converge (exp c, N=6).
## Repro
```bash
# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
# first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
--output traces/w600_r0.0015_st30_ttp.jsonl \
--window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
--json v2/exp_d_policy_dispatch/results/thinktime.json \
leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py
```

View File

@@ -0,0 +1,68 @@
"""exp (d): 5-policy routing under tracets vs thinktime dispatch.
Shows the ranking FLIP: under the faithful `thinktime` load the parameter-free
LPWL (leastwork) is the clear winner, but under `tracets` (think-collapse bursts)
its advantage disappears (it ties unified_ab on TTFT p90 and *loses* on E2E mean).
Reads the two bench_report summaries; writes v2/figs/exp_d_policy_dispatch.png.
Usage: python v2/exp_d_policy_dispatch/plot.py
"""
import json
import os
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
HERE = os.path.dirname(__file__)
TC = json.load(open(os.path.join(HERE, "results/tracets.json")))
TT = json.load(open(os.path.join(HERE, "results/thinktime.json")))
# canonical order: LPWL first; pretty labels
ARMS = ["leastwork", "unified_ab", "unified_def", "lmetric", "sticky"]
LABEL = {"leastwork": "LPWL\n(leastwork)", "unified_ab": "unified\n+A+B",
"unified_def": "unified\ndefault", "lmetric": "LMetric", "sticky": "sticky"}
C_TC, C_TT = "#d62728", "#2ca02c" # tracets red / thinktime green (match exp_c)
def panel(ax, key, sub, title, ylab):
tc = [TC[a][key][sub] / 1000.0 for a in ARMS] # ms -> s
tt = [TT[a][key][sub] / 1000.0 for a in ARMS]
x = range(len(ARMS))
w = 0.38
b1 = ax.bar([i - w / 2 for i in x], tc, w, label="tracets (burst)", color=C_TC)
b2 = ax.bar([i + w / 2 for i in x], tt, w, label="thinktime (faithful)", color=C_TT)
for bars in (b1, b2):
for r in bars:
ax.text(r.get_x() + r.get_width() / 2, r.get_height(),
f"{r.get_height():.1f}", ha="center", va="bottom", fontsize=8)
ax.set_xticks(list(x)); ax.set_xticklabels([LABEL[a] for a in ARMS], fontsize=9)
ax.set_ylabel(ylab); ax.set_title(title, fontsize=11)
ax.grid(axis="y", alpha=.3)
ax.set_ylim(0, max(tc + tt) * 1.18)
# mark LPWL-thinktime as the winner (lowest green) in each panel
ax.annotate("LPWL wins\nunder thinktime", xy=(0 + w / 2, tt[0]),
xytext=(0.9, max(tc + tt) * 0.86), fontsize=8.5, color=C_TT,
ha="left", arrowprops=dict(arrowstyle="->", color=C_TT, lw=1.3))
return b1, b2
fig, (axL, axR) = plt.subplots(1, 2, figsize=(11.2, 4.6))
panel(axL, "ttft_ms", "p90", "TTFT p90 (lower = better)", "TTFT p90 (s)")
panel(axR, "e2e_ms", "mean", "E2E mean (lower = better)", "E2E mean (s)")
axL.legend(loc="upper left", fontsize=9)
fig.suptitle("5-policy routing: dispatch mode flips the ranking — "
"LPWL is best under faithful thinktime, only ties/loses under tracets bursts",
fontsize=11.5)
fig.tight_layout(rect=(0, 0, 1, 0.95))
out = os.path.join(HERE, "..", "figs", "exp_d_policy_dispatch.png")
fig.savefig(out, dpi=140)
print("wrote", os.path.normpath(out))
# also print the deltas the README cites
print("\npolicy TTFTp90 tc->tt E2Emean tc->tt")
for a in ARMS:
t1, t2 = TC[a]["ttft_ms"]["p90"], TT[a]["ttft_ms"]["p90"]
e1, e2 = TC[a]["e2e_ms"]["mean"], TT[a]["e2e_ms"]["mean"]
print(f"{a:<13} {t1/1000:5.1f}->{t2/1000:4.1f}s ({(t2-t1)/t1:+.0%}) "
f"{e1/1000:5.1f}->{e2/1000:4.1f}s ({(e2-e1)/e1:+.0%})")

View File

@@ -0,0 +1,803 @@
{
"leastwork": {
"n_total": 807,
"n_ok": 807,
"window_s": 986.1941225528717,
"ttft_ms": {
"n": 807,
"mean": 3043.454534307026,
"p50": 681.8344180064742,
"p90": 6712.89858900127,
"p99": 41146.725983999204
},
"tpot_ms": {
"n": 806,
"mean": 17.12884673518703,
"p50": 7.770131949655479,
"p90": 17.997618232737178,
"p99": 133.81680370757084
},
"e2e_ms": {
"n": 807,
"mean": 6787.973176127951,
"p50": 2026.8339599715546,
"p90": 17635.302426991984,
"p99": 69945.72682998842
},
"throughput": {
"decode_tps": 234.00362537409853,
"prefill_tps": 8660.302069020001,
"total_tps": 8894.305694394101,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8540739
},
"apc": 0.6756355919409787,
"per_worker": {
"0": {
"n": 96,
"decode_tps": 48.631399136561754,
"prefill_tps": 812.7547930676582,
"ttft_p90_ms": 5368.347445008112,
"gpu_util_mean": 48.6875,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 111,
"decode_tps": 28.45180209284375,
"prefill_tps": 954.9580335787387,
"ttft_p90_ms": 3442.4916800053325,
"gpu_util_mean": 40.479166666666664,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 99,
"decode_tps": 35.558922120953866,
"prefill_tps": 901.7494422882478,
"ttft_p90_ms": 5583.948273997521,
"gpu_util_mean": 48.395833333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 88,
"decode_tps": 20.717016592141224,
"prefill_tps": 1149.215934349922,
"ttft_p90_ms": 6448.1909119931515,
"gpu_util_mean": 38.020833333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 124,
"decode_tps": 38.884839326290034,
"prefill_tps": 891.8842445776638,
"ttft_p90_ms": 4944.760143000167,
"gpu_util_mean": 40.020833333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 110,
"decode_tps": 20.013301183451194,
"prefill_tps": 1581.959336729224,
"ttft_p90_ms": 27228.53080899222,
"gpu_util_mean": 78.19791666666667,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 64,
"decode_tps": 25.779914337947165,
"prefill_tps": 1114.0737658787832,
"ttft_p90_ms": 18414.893322013086,
"gpu_util_mean": 49.833333333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 115,
"decode_tps": 15.966430583909537,
"prefill_tps": 1253.7065185497638,
"ttft_p90_ms": 9039.336649002507,
"gpu_util_mean": 39.5625,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 1.9375,
"ttft_p90_ratio": 7.909541500751002,
"gpu_util_ratio": 2.0567123287671234,
"gpu_util_min": 38.020833333333336,
"gpu_util_max": 78.19791666666667
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 192.46459313074845,
"p50": 177.03324498143047,
"p90": 313.57523999758996,
"p99": 553.8838730135467
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 772.5742901807313,
"p50": 677.829442982329,
"p90": 1460.6262099987362,
"p99": 2101.3274399738293
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 2004.694984432952,
"p50": 1127.2326559992507,
"p90": 5081.04542500223,
"p99": 9901.586207997752
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 9134.502951365745,
"p50": 2167.4920289951842,
"p90": 28926.44312098855,
"p99": 49472.52169801504
}
}
}
},
"unified_ab": {
"n_total": 807,
"n_ok": 807,
"window_s": 986.5525379180908,
"ttft_ms": {
"n": 807,
"mean": 3592.357064001708,
"p50": 676.4678099716548,
"p90": 9736.127940996084,
"p99": 42370.66501099616
},
"tpot_ms": {
"n": 806,
"mean": 13.200466578008895,
"p50": 7.819523662692517,
"p90": 19.090397550442486,
"p99": 133.40408908212945
},
"e2e_ms": {
"n": 807,
"mean": 7131.188424004758,
"p50": 2037.0979200233705,
"p90": 18689.829077018658,
"p99": 63787.50272799516
},
"throughput": {
"decode_tps": 233.91861166055818,
"prefill_tps": 8640.029468666471,
"total_tps": 8873.948080327029,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8523843
},
"apc": 0.6762772765819173,
"per_worker": {
"0": {
"n": 58,
"decode_tps": 29.088161954921237,
"prefill_tps": 930.9397773565431,
"ttft_p90_ms": 13273.868343996583,
"gpu_util_mean": 44.989583333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 98,
"decode_tps": 24.162930086120934,
"prefill_tps": 1018.370498666148,
"ttft_p90_ms": 4365.537890000269,
"gpu_util_mean": 38.90625,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 110,
"decode_tps": 35.40713612040818,
"prefill_tps": 965.8167845888297,
"ttft_p90_ms": 4610.747697995976,
"gpu_util_mean": 52.114583333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 102,
"decode_tps": 20.719626390233998,
"prefill_tps": 1126.5056419045684,
"ttft_p90_ms": 10947.632670984603,
"gpu_util_mean": 41.703125,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 99,
"decode_tps": 44.64435324746667,
"prefill_tps": 911.5449663712324,
"ttft_p90_ms": 4116.690531984204,
"gpu_util_mean": 42.671875,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 110,
"decode_tps": 29.724722072971574,
"prefill_tps": 918.851216898154,
"ttft_p90_ms": 4543.632891000016,
"gpu_util_mean": 40.864583333333336,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 125,
"decode_tps": 28.516474205589404,
"prefill_tps": 1522.1155917037186,
"ttft_p90_ms": 25507.55575299263,
"gpu_util_mean": 76.203125,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 105,
"decode_tps": 21.655207582846195,
"prefill_tps": 1245.884991177276,
"ttft_p90_ms": 20629.490054008784,
"gpu_util_mean": 47.276041666666664,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {
"lmetric_fallback": 389,
"affinity": 418
},
"gpu_captured": true,
"spread": {
"n_ratio": 2.1551724137931036,
"ttft_p90_ratio": 6.196131468910353,
"gpu_util_ratio": 1.9586345381526105,
"gpu_util_min": 38.90625,
"gpu_util_max": 76.203125
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 448.3382160131283,
"p50": 179.28761898656376,
"p90": 323.1771159917116,
"p99": 5748.067840992007
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1455.8712874500252,
"p50": 685.6210659898352,
"p90": 1802.9974120145198,
"p99": 32571.255193004617
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 2672.607777120579,
"p50": 1117.918328003725,
"p90": 5214.129884989234,
"p99": 22190.210508997552
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 9472.201524545819,
"p50": 2150.3282230114564,
"p90": 28876.64386598044,
"p99": 48314.48572798399
}
}
}
},
"unified_def": {
"n_total": 807,
"n_ok": 807,
"window_s": 979.5575842857361,
"ttft_ms": {
"n": 807,
"mean": 4037.2454534798544,
"p50": 695.2703970018774,
"p90": 11267.881545994896,
"p99": 46221.317757997895
},
"tpot_ms": {
"n": 806,
"mean": 16.476541787288614,
"p50": 8.307468241425875,
"p90": 21.768670571627954,
"p99": 200.26358073773736
},
"e2e_ms": {
"n": 807,
"mean": 7974.606969135101,
"p50": 2098.1516239990015,
"p90": 24096.24872301356,
"p99": 72334.40188399982
},
"throughput": {
"decode_tps": 235.5890084484137,
"prefill_tps": 8253.263646460364,
"total_tps": 8488.852654908778,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8084547
},
"apc": 0.6929610772463206,
"per_worker": {
"0": {
"n": 96,
"decode_tps": 39.88024862110074,
"prefill_tps": 791.1724766697671,
"ttft_p90_ms": 5825.010653992649,
"gpu_util_mean": 47.68586387434555,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 55,
"decode_tps": 17.82028977166094,
"prefill_tps": 910.7254277965683,
"ttft_p90_ms": 16298.377383005572,
"gpu_util_mean": 39.2565445026178,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 98,
"decode_tps": 27.174512685142215,
"prefill_tps": 1043.4608606959093,
"ttft_p90_ms": 9739.183520985534,
"gpu_util_mean": 40.83769633507853,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 103,
"decode_tps": 24.518211471470025,
"prefill_tps": 1003.2661844138513,
"ttft_p90_ms": 6705.797864007764,
"gpu_util_mean": 33.50785340314136,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 102,
"decode_tps": 49.593817432818994,
"prefill_tps": 689.9175820202374,
"ttft_p90_ms": 2474.3239340023138,
"gpu_util_mean": 45.246073298429316,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 112,
"decode_tps": 20.50823792523468,
"prefill_tps": 1346.1127974027988,
"ttft_p90_ms": 23553.059853002196,
"gpu_util_mean": 50.109947643979055,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 81,
"decode_tps": 19.51697409799575,
"prefill_tps": 990.0816609032532,
"ttft_p90_ms": 5961.234248999972,
"gpu_util_mean": 38.717277486910994,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 160,
"decode_tps": 36.57671644299036,
"prefill_tps": 1478.5266565579789,
"ttft_p90_ms": 17912.180206010817,
"gpu_util_mean": 85.15183246073299,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {
"lmetric_fallback": 349,
"affinity": 458
},
"gpu_captured": true,
"spread": {
"n_ratio": 2.909090909090909,
"ttft_p90_ratio": 9.518988006919619,
"gpu_util_ratio": 2.5412500000000002,
"gpu_util_min": 33.50785340314136,
"gpu_util_max": 85.15183246073299
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 594.1550390875225,
"p50": 196.222682017833,
"p90": 338.4021449892316,
"p99": 7637.84466200741
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1386.6929373560054,
"p50": 662.5233909871895,
"p90": 1772.5210430216976,
"p99": 19121.71271801344
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 3761.512416864031,
"p50": 1186.4990000030957,
"p90": 7436.603061010828,
"p99": 37502.096537995385
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 9973.751859232492,
"p50": 2084.2301140073687,
"p90": 34646.72368601896,
"p99": 51783.358982007485
}
}
}
},
"lmetric": {
"n_total": 807,
"n_ok": 807,
"window_s": 1036.9893975257874,
"ttft_ms": {
"n": 807,
"mean": 4942.361280256006,
"p50": 1195.667241991032,
"p90": 15606.655231997138,
"p99": 46217.127193987835
},
"tpot_ms": {
"n": 806,
"mean": 19.707597229545165,
"p50": 9.35281297406689,
"p90": 30.177805961172382,
"p99": 232.18400578116416
},
"e2e_ms": {
"n": 807,
"mean": 9901.839828112516,
"p50": 3177.2723750036675,
"p90": 27819.4430010044,
"p99": 73672.06387300394
},
"throughput": {
"decode_tps": 222.5413302687709,
"prefill_tps": 13134.949144609054,
"total_tps": 13357.490474877826,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 13620803
},
"apc": 0.48270240989877555,
"per_worker": {
"0": {
"n": 121,
"decode_tps": 40.13348651326501,
"prefill_tps": 1973.9218210754154,
"ttft_p90_ms": 23894.41591600189,
"gpu_util_mean": 90.75247524752476,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 128,
"decode_tps": 44.98117349250917,
"prefill_tps": 1626.6328315647543,
"ttft_p90_ms": 5918.853377981577,
"gpu_util_mean": 64.96039603960396,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 109,
"decode_tps": 26.800659742819484,
"prefill_tps": 1578.2861463241723,
"ttft_p90_ms": 13917.768498009536,
"gpu_util_mean": 58.306930693069305,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 99,
"decode_tps": 19.435107097610242,
"prefill_tps": 1683.1715002723502,
"ttft_p90_ms": 16737.5574040052,
"gpu_util_mean": 59.16831683168317,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 116,
"decode_tps": 19.955845305048445,
"prefill_tps": 1884.7752972820501,
"ttft_p90_ms": 11347.276910004439,
"gpu_util_mean": 50.36138613861386,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 61,
"decode_tps": 12.497716978516857,
"prefill_tps": 1726.7611455549827,
"ttft_p90_ms": 31680.082703998778,
"gpu_util_mean": 55.93069306930693,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 88,
"decode_tps": 38.23472071614312,
"prefill_tps": 1208.0914259963265,
"ttft_p90_ms": 9533.787049993407,
"gpu_util_mean": 51.62871287128713,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 85,
"decode_tps": 20.50262042285856,
"prefill_tps": 1453.3089765390037,
"ttft_p90_ms": 14970.007644995349,
"gpu_util_mean": 51.757425742574256,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 2.098360655737705,
"ttft_p90_ratio": 5.352402007768976,
"gpu_util_ratio": 1.8020249680526887,
"gpu_util_min": 50.36138613861386,
"gpu_util_max": 90.75247524752476
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 511.51012201982036,
"p50": 255.4193850082811,
"p90": 471.22472297633067,
"p99": 3532.1444049768616
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1010.5527848093863,
"p50": 818.2104199950118,
"p90": 1878.1264800054487,
"p99": 4416.228823014535
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 3164.034748000338,
"p50": 2636.801838991232,
"p90": 7400.190736021614,
"p99": 9636.447697004769
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 15215.938255342222,
"p50": 12060.85875100689,
"p90": 36602.47571900254,
"p99": 52271.21993701439
}
}
}
},
"sticky": {
"n_total": 807,
"n_ok": 807,
"window_s": 994.9787130355835,
"ttft_ms": {
"n": 807,
"mean": 4455.946148958436,
"p50": 713.0627470032778,
"p90": 14838.208375993418,
"p99": 43174.81458699331
},
"tpot_ms": {
"n": 806,
"mean": 19.138733289320065,
"p50": 8.24416923684399,
"p90": 23.769559945071954,
"p99": 184.6952650922511
},
"e2e_ms": {
"n": 807,
"mean": 8663.490226920512,
"p50": 2352.715140004875,
"p90": 24966.471978026675,
"p99": 70932.61348700617
},
"throughput": {
"decode_tps": 231.93762537485247,
"prefill_tps": 8105.277926394779,
"total_tps": 8337.215551769632,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8064579
},
"apc": 0.6937194318219754,
"per_worker": {
"0": {
"n": 156,
"decode_tps": 44.672312500428745,
"prefill_tps": 1949.5271351907093,
"ttft_p90_ms": 20576.009418989997,
"gpu_util_mean": 93.18041237113403,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 114,
"decode_tps": 44.75372127725863,
"prefill_tps": 929.0429914624127,
"ttft_p90_ms": 5498.717762995511,
"gpu_util_mean": 53.08247422680412,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 88,
"decode_tps": 29.785561853462614,
"prefill_tps": 904.2113044360427,
"ttft_p90_ms": 12234.77461998118,
"gpu_util_mean": 49.628865979381445,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 98,
"decode_tps": 29.982550992458386,
"prefill_tps": 1018.2680159145942,
"ttft_p90_ms": 16286.48554199026,
"gpu_util_mean": 44.123711340206185,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 110,
"decode_tps": 37.69427376549181,
"prefill_tps": 949.1017120546454,
"ttft_p90_ms": 6709.773182024946,
"gpu_util_mean": 45.7680412371134,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 99,
"decode_tps": 19.964246209244884,
"prefill_tps": 980.7747514746083,
"ttft_p90_ms": 14065.780322009232,
"gpu_util_mean": 36.324742268041234,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 79,
"decode_tps": 11.21531531660107,
"prefill_tps": 682.5723918534845,
"ttft_p90_ms": 4579.089447972365,
"gpu_util_mean": 22.288659793814432,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 63,
"decode_tps": 13.869643459906332,
"prefill_tps": 691.7796240082818,
"ttft_p90_ms": 18229.593775991816,
"gpu_util_mean": 30.762886597938145,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 2.4761904761904763,
"ttft_p90_ratio": 4.493471825081102,
"gpu_util_ratio": 4.180619796484737,
"gpu_util_min": 22.288659793814432,
"gpu_util_max": 93.18041237113403
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 827.0193562525294,
"p50": 197.0047799986787,
"p90": 507.2060489910655,
"p99": 19187.98109301133
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 2624.659966896439,
"p50": 736.4085000008345,
"p90": 3899.43698499701,
"p99": 33760.123436979484
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 3807.600332329692,
"p50": 1086.1541359918192,
"p90": 9912.624888995197,
"p99": 40516.03257699753
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 9766.785228673292,
"p50": 2521.5582190139685,
"p90": 34039.37866198248,
"p99": 47948.314540000865
}
}
}
}
}

View File

@@ -0,0 +1,803 @@
{
"leastwork": {
"n_total": 807,
"n_ok": 807,
"window_s": 1045.237051486969,
"ttft_ms": {
"n": 807,
"mean": 4029.99802739754,
"p50": 856.9921070011333,
"p90": 11099.306205986068,
"p99": 43400.520397000946
},
"tpot_ms": {
"n": 806,
"mean": 21.48754069144944,
"p50": 8.545071840088175,
"p90": 33.14273249998223,
"p99": 221.47291811146448
},
"e2e_ms": {
"n": 807,
"mean": 9827.03743343642,
"p50": 2474.214674992254,
"p90": 25365.627319028135,
"p99": 93929.44298699149
},
"throughput": {
"decode_tps": 220.78532297692573,
"prefill_tps": 8826.999566150593,
"total_tps": 9047.784889127519,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 9226307
},
"apc": 0.6495987515101673,
"per_worker": {
"0": {
"n": 95,
"decode_tps": 26.681028920976477,
"prefill_tps": 996.0917463831219,
"ttft_p90_ms": 5435.623251018114,
"gpu_util_mean": 40.30392156862745,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 94,
"decode_tps": 53.13244485640239,
"prefill_tps": 951.7659162433571,
"ttft_p90_ms": 10786.369787005242,
"gpu_util_mean": 61.27450980392157,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 89,
"decode_tps": 19.963892372849116,
"prefill_tps": 1100.5666115294052,
"ttft_p90_ms": 4944.386984978337,
"gpu_util_mean": 36.745098039215684,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 117,
"decode_tps": 23.37842881214075,
"prefill_tps": 1155.5378737117585,
"ttft_p90_ms": 6188.670104980702,
"gpu_util_mean": 40.76470588235294,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 108,
"decode_tps": 34.993018997907186,
"prefill_tps": 947.086594941991,
"ttft_p90_ms": 4642.632269999012,
"gpu_util_mean": 40.19607843137255,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 87,
"decode_tps": 10.190989675352892,
"prefill_tps": 1130.268964651918,
"ttft_p90_ms": 9265.34449501196,
"gpu_util_mean": 29.41176470588235,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 130,
"decode_tps": 20.925396749842136,
"prefill_tps": 1646.6073390256745,
"ttft_p90_ms": 38816.55501498608,
"gpu_util_mean": 81.27450980392157,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 87,
"decode_tps": 31.520122591454786,
"prefill_tps": 899.0745196633663,
"ttft_p90_ms": 7304.075189982541,
"gpu_util_mean": 42.26960784313726,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 1.4942528735632183,
"ttft_p90_ratio": 8.3608937252733,
"gpu_util_ratio": 2.7633333333333336,
"gpu_util_min": 29.41176470588235,
"gpu_util_max": 81.27450980392157
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 2123.6411421857115,
"p50": 202.10654201218858,
"p90": 478.5369329911191,
"p99": 39781.0776779952
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1228.5320352953054,
"p50": 757.9952410014812,
"p90": 1679.417210019892,
"p99": 15248.28791298205
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 2886.291041083525,
"p50": 1301.9832599966321,
"p90": 5246.098520990927,
"p99": 39812.045788014075
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 10579.37216416889,
"p50": 2986.336703004781,
"p90": 34044.874527986394,
"p99": 51031.28803099389
}
}
}
},
"unified_ab": {
"n_total": 807,
"n_ok": 807,
"window_s": 1081.3928244113922,
"ttft_ms": {
"n": 807,
"mean": 4003.8645795703064,
"p50": 745.6592369999271,
"p90": 10783.10890001012,
"p99": 46727.02033401583
},
"tpot_ms": {
"n": 806,
"mean": 18.129335403553664,
"p50": 8.004697278213117,
"p90": 20.508462421730655,
"p99": 188.8185092436804
},
"e2e_ms": {
"n": 807,
"mean": 8531.231209817679,
"p50": 2227.301309001632,
"p90": 22062.78157699853,
"p99": 75419.32771002757
},
"throughput": {
"decode_tps": 213.4034874196719,
"prefill_tps": 8110.4912128244405,
"total_tps": 8323.894700244113,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8770627
},
"apc": 0.66690479182639,
"per_worker": {
"0": {
"n": 119,
"decode_tps": 36.19868665330458,
"prefill_tps": 1627.2449384503814,
"ttft_p90_ms": 28329.616097005783,
"gpu_util_mean": 85.30805687203791,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 94,
"decode_tps": 30.323855734708484,
"prefill_tps": 977.303507238684,
"ttft_p90_ms": 5139.202910999302,
"gpu_util_mean": 41.165876777251185,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 79,
"decode_tps": 13.535321919642842,
"prefill_tps": 1030.6661694437062,
"ttft_p90_ms": 16363.982771988958,
"gpu_util_mean": 35.014218009478675,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 122,
"decode_tps": 25.46530675842895,
"prefill_tps": 900.9917376974744,
"ttft_p90_ms": 5133.929038012866,
"gpu_util_mean": 35.85781990521327,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 101,
"decode_tps": 39.54437188287811,
"prefill_tps": 731.5889121334044,
"ttft_p90_ms": 8783.158237987664,
"gpu_util_mean": 45.843601895734594,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 109,
"decode_tps": 32.10951581715914,
"prefill_tps": 842.2332564447569,
"ttft_p90_ms": 4199.806818010984,
"gpu_util_mean": 37.32701421800948,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 89,
"decode_tps": 8.69804181021798,
"prefill_tps": 1146.4917946675243,
"ttft_p90_ms": 11112.522551004076,
"gpu_util_mean": 30.95734597156398,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 94,
"decode_tps": 27.528386843331813,
"prefill_tps": 853.9708967485094,
"ttft_p90_ms": 10584.918729990022,
"gpu_util_mean": 47.426540284360186,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {
"lmetric_fallback": 394,
"affinity": 413
},
"gpu_captured": true,
"spread": {
"n_ratio": 1.5443037974683544,
"ttft_p90_ratio": 6.745456951856325,
"gpu_util_ratio": 2.7556644213104713,
"gpu_util_min": 30.95734597156398,
"gpu_util_max": 85.30805687203791
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 473.4928183806124,
"p50": 193.41183800133877,
"p90": 326.81434298865497,
"p99": 5278.063865000149
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1676.7254021373014,
"p50": 733.51754702162,
"p90": 1942.2162729897536,
"p99": 28329.616097005783
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 2520.2562936371373,
"p50": 1149.4361000077333,
"p90": 5139.202910999302,
"p99": 26739.575799991144
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 11092.085469873233,
"p50": 2945.923403982306,
"p90": 38718.26310700271,
"p99": 51830.85186799872
}
}
}
},
"unified_def": {
"n_total": 807,
"n_ok": 807,
"window_s": 912.5732414722443,
"ttft_ms": {
"n": 807,
"mean": 4275.594404255811,
"p50": 757.1689730102662,
"p90": 12997.265826008515,
"p99": 47988.61391301034
},
"tpot_ms": {
"n": 806,
"mean": 18.24678916181055,
"p50": 8.282395397119393,
"p90": 19.536432251223843,
"p99": 127.01842809143604
},
"e2e_ms": {
"n": 807,
"mean": 8365.99611452138,
"p50": 2119.7480160044506,
"p90": 22818.839199026115,
"p99": 82257.18197401147
},
"throughput": {
"decode_tps": 252.8816203592563,
"prefill_tps": 8854.578057717206,
"total_tps": 9107.459678076462,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8080451
},
"apc": 0.6931166371592755,
"per_worker": {
"0": {
"n": 86,
"decode_tps": 48.0925782233062,
"prefill_tps": 878.7174152798008,
"ttft_p90_ms": 9217.135582002811,
"gpu_util_mean": 49.02808988764045,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 114,
"decode_tps": 36.67650822853761,
"prefill_tps": 962.1616765613936,
"ttft_p90_ms": 5333.489734999603,
"gpu_util_mean": 43.674157303370784,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 75,
"decode_tps": 21.302399759867672,
"prefill_tps": 1132.1228292133994,
"ttft_p90_ms": 16419.932407996384,
"gpu_util_mean": 44.235955056179776,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 94,
"decode_tps": 39.351362025545676,
"prefill_tps": 918.7383126064411,
"ttft_p90_ms": 14089.503638009774,
"gpu_util_mean": 57.30898876404494,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 103,
"decode_tps": 23.649608622297535,
"prefill_tps": 1200.1765449894706,
"ttft_p90_ms": 12823.167912021745,
"gpu_util_mean": 43.82022471910113,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 104,
"decode_tps": 25.253863419028313,
"prefill_tps": 1067.8562067279715,
"ttft_p90_ms": 18659.113589994377,
"gpu_util_mean": 52.449438202247194,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 114,
"decode_tps": 27.64600018218629,
"prefill_tps": 1278.7061322523903,
"ttft_p90_ms": 7502.1790039900225,
"gpu_util_mean": 42.62359550561798,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 117,
"decode_tps": 30.90929989848701,
"prefill_tps": 1416.0989400863393,
"ttft_p90_ms": 15775.390956026968,
"gpu_util_mean": 79.19101123595506,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {
"lmetric_fallback": 353,
"affinity": 454
},
"gpu_captured": true,
"spread": {
"n_ratio": 1.56,
"ttft_p90_ratio": 3.4984812040696216,
"gpu_util_ratio": 1.8579148543561357,
"gpu_util_min": 42.62359550561798,
"gpu_util_max": 79.19101123595506
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 187.7879224041902,
"p50": 171.855135995429,
"p90": 318.49737899028696,
"p99": 406.7676870035939
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 2014.1915133893583,
"p50": 737.6636109838728,
"p90": 2057.0135610178113,
"p99": 29051.10613000579
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 3298.1720407103326,
"p50": 1321.936836000532,
"p90": 7205.449934001081,
"p99": 36468.73455500463
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 10874.266077009788,
"p50": 2567.718550999416,
"p90": 35562.78318798286,
"p99": 61205.50292698317
}
}
}
},
"lmetric": {
"n_total": 807,
"n_ok": 807,
"window_s": 1077.4112372398376,
"ttft_ms": {
"n": 807,
"mean": 5152.329462028203,
"p50": 1274.1917690145783,
"p90": 16492.156354011968,
"p99": 46248.138958995696
},
"tpot_ms": {
"n": 806,
"mean": 24.0370380963572,
"p50": 10.534365684851883,
"p90": 38.95731354601127,
"p99": 231.34709527787183
},
"e2e_ms": {
"n": 807,
"mean": 10774.921962922257,
"p50": 3460.944951977581,
"p90": 27791.26176200225,
"p99": 99231.47636200883
},
"throughput": {
"decode_tps": 214.19212276939402,
"prefill_tps": 12337.071064946676,
"total_tps": 12551.263187716071,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 13292099
},
"apc": 0.49518609291339905,
"per_worker": {
"0": {
"n": 140,
"decode_tps": 40.62422823074453,
"prefill_tps": 2047.015033600442,
"ttft_p90_ms": 13910.764692991506,
"gpu_util_mean": 92.98571428571428,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 89,
"decode_tps": 23.129515582036454,
"prefill_tps": 1463.9739641483652,
"ttft_p90_ms": 17954.478895000648,
"gpu_util_mean": 60.319047619047616,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 92,
"decode_tps": 18.14256184117442,
"prefill_tps": 1663.4864553589516,
"ttft_p90_ms": 15653.674011002295,
"gpu_util_mean": 55.94761904761905,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 92,
"decode_tps": 43.91637878329201,
"prefill_tps": 1313.127198881315,
"ttft_p90_ms": 13397.495551005704,
"gpu_util_mean": 54.385714285714286,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 134,
"decode_tps": 21.903428500019192,
"prefill_tps": 1412.6889969137978,
"ttft_p90_ms": 21995.840935007436,
"gpu_util_mean": 51.51428571428571,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 88,
"decode_tps": 16.934109622562403,
"prefill_tps": 1474.5493132872787,
"ttft_p90_ms": 11013.645827013534,
"gpu_util_mean": 42.55238095238095,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 108,
"decode_tps": 30.606697665860118,
"prefill_tps": 1651.413070981298,
"ttft_p90_ms": 18613.971301994752,
"gpu_util_mean": 52.67619047619048,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 64,
"decode_tps": 18.935202543704882,
"prefill_tps": 1310.817031775228,
"ttft_p90_ms": 15090.364522009622,
"gpu_util_mean": 41.904761904761905,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 2.1875,
"ttft_p90_ratio": 1.997144386199301,
"gpu_util_ratio": 2.2189772727272725,
"gpu_util_min": 41.904761904761905,
"gpu_util_max": 92.98571428571428
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 1789.4509492392028,
"p50": 262.9711049958132,
"p90": 1572.630943992408,
"p99": 24139.729285001522
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 1697.68317032391,
"p50": 920.4712719947565,
"p90": 2029.556789988419,
"p99": 22497.115491016302
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 3424.9239484553045,
"p50": 2699.253859987948,
"p90": 6799.459913017927,
"p99": 24401.87052200781
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 14137.37210560736,
"p50": 11013.645827013534,
"p90": 35319.35577199329,
"p99": 51099.66781400726
}
}
}
},
"sticky": {
"n_total": 807,
"n_ok": 807,
"window_s": 925.9680500030518,
"ttft_ms": {
"n": 807,
"mean": 4914.133386887214,
"p50": 906.6202020039782,
"p90": 15236.451414995827,
"p99": 46771.370519010816
},
"tpot_ms": {
"n": 806,
"mean": 22.62391467634138,
"p50": 9.728358783056292,
"p90": 30.957536839455965,
"p99": 231.30005976865783
},
"e2e_ms": {
"n": 807,
"mean": 10139.474540275223,
"p50": 2597.957960999338,
"p90": 27973.595037998166,
"p99": 82362.0547579776
},
"throughput": {
"decode_tps": 249.22350182518656,
"prefill_tps": 8743.631057219865,
"total_tps": 8992.854559045052,
"total_output_tokens": 230773,
"total_new_prefill_tokens": 8096323
},
"apc": 0.6925138424965755,
"per_worker": {
"0": {
"n": 136,
"decode_tps": 55.41875877880546,
"prefill_tps": 1434.2957081463262,
"ttft_p90_ms": 11807.82909199479,
"gpu_util_mean": 65.71270718232044,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"1": {
"n": 129,
"decode_tps": 38.47432964872015,
"prefill_tps": 1709.658340797809,
"ttft_p90_ms": 23309.77585897199,
"gpu_util_mean": 86.77900552486187,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"2": {
"n": 108,
"decode_tps": 47.74786775834683,
"prefill_tps": 1144.2727424520835,
"ttft_p90_ms": 30574.395572999492,
"gpu_util_mean": 63.50828729281768,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"3": {
"n": 78,
"decode_tps": 25.291369394357414,
"prefill_tps": 1245.5040970325047,
"ttft_p90_ms": 26070.17321899184,
"gpu_util_mean": 51.87845303867403,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"4": {
"n": 132,
"decode_tps": 40.39124244068328,
"prefill_tps": 941.3586138281199,
"ttft_p90_ms": 5989.128820016049,
"gpu_util_mean": 43.0828729281768,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"5": {
"n": 84,
"decode_tps": 22.365782491017637,
"prefill_tps": 1030.0301398054248,
"ttft_p90_ms": 14970.723142003408,
"gpu_util_mean": 41.43646408839779,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"6": {
"n": 66,
"decode_tps": 9.593203566765315,
"prefill_tps": 669.7250515262986,
"ttft_p90_ms": 13777.997018012684,
"gpu_util_mean": 20.994475138121548,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
},
"7": {
"n": 74,
"decode_tps": 9.940947746490457,
"prefill_tps": 568.7863636312983,
"ttft_p90_ms": 7027.451686997665,
"gpu_util_mean": 18.50828729281768,
"gpu_util_max": 100.0,
"gpu_mem_max_mb": 89575.0
}
},
"decisions": {},
"gpu_captured": true,
"spread": {
"n_ratio": 2.0606060606060606,
"ttft_p90_ratio": 5.104982125416625,
"gpu_util_ratio": 4.68865671641791,
"gpu_util_min": 18.50828729281768,
"gpu_util_max": 86.77900552486187
},
"per_class": {
"WARM<5k": {
"n": 92,
"ttft_ms": {
"n": 92,
"mean": 892.6154872479738,
"p50": 207.1402580186259,
"p90": 375.00955499126576,
"p99": 11232.500832004007
}
},
"MED5-20k": {
"n": 278,
"ttft_ms": {
"n": 278,
"mean": 2908.9762750470118,
"p50": 770.736623002449,
"p90": 4912.921145994915,
"p99": 42022.69450199674
}
},
"HEAVY20-50k": {
"n": 248,
"ttft_ms": {
"n": 248,
"mean": 4250.573046338691,
"p50": 1623.4680919733364,
"p90": 11137.098645005608,
"p99": 39817.45037299697
}
},
"HEAVY+>50k": {
"n": 189,
"ttft_ms": {
"n": 189,
"mean": 10691.78570601113,
"p50": 2671.919913002057,
"p90": 36922.92091701529,
"p99": 53025.03776800586
}
}
}
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB