3.1 KiB
3.1 KiB
qwen235b Thinking Prefill Harness Test
Setup
- Workload:
qwen3-235b-a22bthinking trace, prefill-only replay withmin_tokens=max_tokens=1. - Window:
thinking_w20260327_1000. - SLO: 95% pass rate, stepped TTFT
3s/6s/9s. - Metric: best-so-far feasible
request_rate_per_gpu. - Before-harness source: actual 12-trial run
.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology. - Harness test source:
.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427.
Result So Far
The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
Trial Details
| Variant | Iter | Config | Result |
|---|---|---|---|
| Before harness | 1 | baseline TP4/DP1/EP-off, MBT=8192 |
0.2029 req/s/gpu |
| Before harness | 2 | DP=2, MBT=4096 |
runtime failure |
| Before harness | 3 | DP=2, MBT=8192 |
runtime failure |
| Before harness | 4 | EP=4 |
launch failure |
| Before harness | 6 | TP8/DP1/EP-off, MBT=4096 |
0.3575 req/s/gpu |
| Before harness | 10 | TP8/DP1/EP-off, MBT=3712 |
0.3794 req/s/gpu, best |
| Harness | 1 | baseline TP4/DP1/EP-off, MBT=8192 |
0.1892 req/s/gpu |
| Harness | 2 | TP8/DP1/EP-off, MBT=8192 |
0.3863 req/s/gpu, best so far |
| Harness | 3 | TP8/DP1/EP=2 |
launch failure |
The harness baseline was slightly lower than the original baseline (0.1892 vs 0.2029 req/s/gpu), but iter 2 still exceeded the original 12-trial best (0.3863 vs 0.3794 req/s/gpu).
Convergence Judgment
- Before harness reached its best at iter 10.
- Harness reached a better result at iter 2.
- Iterations-to-best improved from
10to2, a5ximprovement on this run. - The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to
TP8/DP1.
Follow-Up Optimization
The run also exposed a remaining weakness: after reaching the strong TP8/DP1 incumbent, the LLM proposed EP=2, which failed at launch. To address that, the harness was tightened after this test:
- strong-incumbent stop threshold changed from
3xto1.8xover baseline; - expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
With the new guard, the intended behavior after this iter-2 result is should_stop=true unless a same-topology runtime harness has strong direct evidence.
Run Status
- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
- GPUs were freed after stopping the run.