3.5 KiB
3.5 KiB
qwen27b-chat-0-8k Harness Fig18
Setup
- Workload:
qwen3.5-27bchat,0 <= input_length <= 8192. - Window:
chat_w20260311_1000. - Engine: dash0 internal vLLM, baseline aligned to
run_qwen27b.sh. - SLO: 95% pass rate, stepped TTFT
2s/4s/6s, TPOT<=50ms. - Search metric: best-so-far feasible
request_rate_per_gpu. - Before-harness source: actual 12-trial run
.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology. - After-harness source: strict harness replay over already measured run9 configs:
- Iter 1 uses the measured baseline trial.
- Iter 2 uses the current harness proposal after seeing only iter 1 history. It proposes
TP=2, DP=1, whose performance is the measured run9trial-0004result for the same config and spec. - Iter 3 uses the current harness proposal after seeing only baseline +
TP=2, DP=1. With the strong-incumbent guard, it returnsshould_stop=true.
The replay is intentionally strict: the LLM prompt does not receive future best_by_parallel_size entries or later failed trials.
Fig18-Style Best-So-Far Curve
Unit: feasible request_rate_per_gpu. Infeasible trials leave the best-so-far value unchanged.
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Before harness, actual run9 | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
| After harness, strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2025 |
Trial-Level Interpretation
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 |
|---|---|---|---|---|---|
| Before harness | baseline TP1/DP1, 0.0350 |
DP=2, 0.0617 |
DP=4, 0.0392, worse per GPU |
TP=2, DP=1, 0.2025, best |
runtime-only probes, all infeasible |
| After harness | baseline TP1/DP1, 0.0350 |
TP=2, DP=1, 0.2025, best |
should_stop=true |
no GPU trial | no GPU trial |
Convergence Judgment
- Before harness reaches the final best value at iter 4.
- After harness reaches the same best value at iter 2.
- The speedup is
2xby iterations-to-best:4 -> 2. - The harness also avoids the post-best weak proposals: before harness spent iters 5-12 on infeasible runtime-only probes; after harness stops at iter 3.
Implementation Changes From This Check
- Added a strong-incumbent convergence guard:
- if the latest trial is the incumbent,
- and it improves
request_rate_per_gpuby at least3xover the baseline, - then runtime-only probes require direct same-topology evidence; otherwise the LLM should return
should_stop=true.
- Strengthened the MBT harness guard:
- do not raise
max-num-batched-tokenswhen incumbent MBT already covers prompt p99 unless same-topology evidence proves prefill fragmentation.
- do not raise
- Made early-stop engine relaunch opt-in. A real r2 run showed that default relaunch changes warm-state behavior and makes full-chat results incomparable with run9, so the default remains drain-based for comparable production measurements.
- Added LLM empty-response retry to avoid crashing
study tuneon a transient empty streamed response.
Remote Checks
- Local:
python3 -m compileall -q src testspassed. - Local:
PYTHONPATH=src python3 -m unittest tests.test_core_flowpassed, 63 tests. - dash0:
python3 -m compileall -q src testspassed. - dash0:
PYTHONPATH=src python3 -m unittest discover -s tests -p "test_*.py"passed, 63 tests.