# qwen235b Thinking Prefill Harness Test

## Setup

- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
- Window: `thinking_w20260327_1000`.
- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
- Metric: best-so-far feasible `request_rate_per_gpu`.
- Before-harness source: actual 12-trial run
  `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
- Harness test source:
  `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.

## Result So Far

The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.

| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |

## Trial Details

| Variant | Iter | Config | Result |
| --- | ---: | --- | --- |
| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
| Before harness | 4 | `EP=4` | launch failure |
| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
| Harness | 3 | `TP8/DP1/EP=2` | launch failure |

The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).

## Convergence Judgment

- Before harness reached its best at iter 10.
- Harness reached a better result at iter 2.
- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.

## Follow-Up Optimization

The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:

- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.

With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.

## Run Status

- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
- GPUs were freed after stopping the run.