qwen27b-chat-0-8k Harness Fig18

Setup

Workload: qwen3.5-27b chat, 0 <= input_length <= 8192.
Window: chat_w20260311_1000.
Engine: dash0 internal vLLM, baseline aligned to run_qwen27b.sh.
SLO: 95% pass rate, stepped TTFT 2s/4s/6s, TPOT <=50ms.
Search metric: best-so-far feasible request_rate_per_gpu.
Before-harness source: actual 12-trial run .aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology.
After-harness source: strict harness replay over already measured run9 configs:
- Iter 1 uses the measured baseline trial.
- Iter 2 uses the current harness proposal after seeing only iter 1 history. It proposes TP=2, DP=1, whose performance is the measured run9 trial-0004 result for the same config and spec.
- Iter 3 uses the current harness proposal after seeing only baseline + TP=2, DP=1. With the strong-incumbent guard, it returns should_stop=true.

The replay is intentionally strict: the LLM prompt does not receive future best_by_parallel_size entries or later failed trials.

Unit: feasible request_rate_per_gpu. Infeasible trials leave the best-so-far value unchanged.

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8	Iter 9	Iter 10	Iter 11	Iter 12
Before harness, actual run9	0.0350	0.0617	0.0617	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025
After harness, strict replay	0.0350	0.2025	0.2025 stop	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025	0.2025

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5-12
Before harness	baseline `TP1/DP1`, 0.0350	`DP=2`, 0.0617	`DP=4`, 0.0392, worse per GPU	`TP=2, DP=1`, 0.2025, best	runtime-only probes, all infeasible
After harness	baseline `TP1/DP1`, 0.0350	`TP=2, DP=1`, 0.2025, best	`should_stop=true`	no GPU trial	no GPU trial

Before harness reaches the final best value at iter 4.
After harness reaches the same best value at iter 2.
The speedup is 2x by iterations-to-best: 4 -> 2.
The harness also avoids the post-best weak proposals: before harness spent iters 5-12 on infeasible runtime-only probes; after harness stops at iter 3.

Added a strong-incumbent convergence guard:
- if the latest trial is the incumbent,
- and it improves request_rate_per_gpu by at least 3x over the baseline,
- then runtime-only probes require direct same-topology evidence; otherwise the LLM should return should_stop=true.
Strengthened the MBT harness guard:
- do not raise max-num-batched-tokens when incumbent MBT already covers prompt p99 unless same-topology evidence proves prefill fragmentation.
Made early-stop engine relaunch opt-in. A real r2 run showed that default relaunch changes warm-state behavior and makes full-chat results incomparable with run9, so the default remains drain-based for comparable production measurements.
Added LLM empty-response retry to avoid crashing study tune on a transient empty streamed response.

Local: python3 -m compileall -q src tests passed.
Local: PYTHONPATH=src python3 -m unittest tests.test_core_flow passed, 63 tests.
dash0: python3 -m compileall -q src tests passed.
dash0: PYTHONPATH=src python3 -m unittest discover -s tests -p "test_*.py" passed, 63 tests.