qwen235b Thinking Prefill Harness Test

Setup

Workload: qwen3-235b-a22b thinking trace, prefill-only replay with min_tokens=max_tokens=1.
Window: thinking_w20260327_1000.
SLO: 95% pass rate, stepped TTFT 3s/6s/9s.
Metric: best-so-far feasible request_rate_per_gpu.
Before-harness source: actual 12-trial run .aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology.
Harness test source: .aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427.

Result So Far

The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Iter 6	Iter 7	Iter 8	Iter 9	Iter 10	Iter 11	Iter 12
Before harness, actual run1	0.2029	0.2029	0.2029	0.2029	0.2029	0.3575	0.3575	0.3708	0.3708	0.3794	0.3794	0.3794
Harness, actual 2026-04-27 run	0.1892	0.3863	0.3863	0.3863	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a

Trial Details

Variant	Iter	Config	Result
Before harness	1	baseline `TP4/DP1/EP-off`, `MBT=8192`	`0.2029 req/s/gpu`
Before harness	2	`DP=2`, `MBT=4096`	runtime failure
Before harness	3	`DP=2`, `MBT=8192`	runtime failure
Before harness	4	`EP=4`	launch failure
Before harness	6	`TP8/DP1/EP-off`, `MBT=4096`	`0.3575 req/s/gpu`
Before harness	10	`TP8/DP1/EP-off`, `MBT=3712`	`0.3794 req/s/gpu`, best
Harness	1	baseline `TP4/DP1/EP-off`, `MBT=8192`	`0.1892 req/s/gpu`
Harness	2	`TP8/DP1/EP-off`, `MBT=8192`	`0.3863 req/s/gpu`, best so far
Harness	3	`TP8/DP1/EP=2`	launch failure

The harness baseline was slightly lower than the original baseline (0.1892 vs 0.2029 req/s/gpu), but iter 2 still exceeded the original 12-trial best (0.3863 vs 0.3794 req/s/gpu).

Convergence Judgment

Before harness reached its best at iter 10.
Harness reached a better result at iter 2.
Iterations-to-best improved from 10 to 2, a 5x improvement on this run.
The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to TP8/DP1.

Follow-Up Optimization

The run also exposed a remaining weakness: after reaching the strong TP8/DP1 incumbent, the LLM proposed EP=2, which failed at launch. To address that, the harness was tightened after this test:

strong-incumbent stop threshold changed from 3x to 1.8x over baseline;
expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.

With the new guard, the intended behavior after this iter-2 result is should_stop=true unless a same-topology runtime harness has strong direct evidence.

Run Status

The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
GPUs were freed after stopping the run.

3.1 KiB Raw Blame History