# qwen235b Thinking Prefill Harness Test ## Setup - Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`. - Window: `thinking_w20260327_1000`. - SLO: 95% pass rate, stepped TTFT `3s/6s/9s`. - Metric: best-so-far feasible `request_rate_per_gpu`. - Before-harness source: actual 12-trial run `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`. - Harness test source: `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`. ## Result So Far The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2. | Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 | | Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | ## Trial Details | Variant | Iter | Config | Result | | --- | ---: | --- | --- | | Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` | | Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure | | Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure | | Before harness | 4 | `EP=4` | launch failure | | Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` | | Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best | | Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` | | Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far | | Harness | 3 | `TP8/DP1/EP=2` | launch failure | The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`). ## Convergence Judgment - Before harness reached its best at iter 10. - Harness reached a better result at iter 2. - Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run. - The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`. ## Follow-Up Optimization The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test: - strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline; - expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence. With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence. ## Run Status - The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure. - GPUs were freed after stopping the run.