Files
aituner/docs/qwen235b-thinking-prefill-harness-20260427.md

3.1 KiB

qwen235b Thinking Prefill Harness Test

Setup

  • Workload: qwen3-235b-a22b thinking trace, prefill-only replay with min_tokens=max_tokens=1.
  • Window: thinking_w20260327_1000.
  • SLO: 95% pass rate, stepped TTFT 3s/6s/9s.
  • Metric: best-so-far feasible request_rate_per_gpu.
  • Before-harness source: actual 12-trial run .aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology.
  • Harness test source: .aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427.

Result So Far

The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8 Iter 9 Iter 10 Iter 11 Iter 12
Before harness, actual run1 0.2029 0.2029 0.2029 0.2029 0.2029 0.3575 0.3575 0.3708 0.3708 0.3794 0.3794 0.3794
Harness, actual 2026-04-27 run 0.1892 0.3863 0.3863 0.3863 n/a n/a n/a n/a n/a n/a n/a n/a

Trial Details

Variant Iter Config Result
Before harness 1 baseline TP4/DP1/EP-off, MBT=8192 0.2029 req/s/gpu
Before harness 2 DP=2, MBT=4096 runtime failure
Before harness 3 DP=2, MBT=8192 runtime failure
Before harness 4 EP=4 launch failure
Before harness 6 TP8/DP1/EP-off, MBT=4096 0.3575 req/s/gpu
Before harness 10 TP8/DP1/EP-off, MBT=3712 0.3794 req/s/gpu, best
Harness 1 baseline TP4/DP1/EP-off, MBT=8192 0.1892 req/s/gpu
Harness 2 TP8/DP1/EP-off, MBT=8192 0.3863 req/s/gpu, best so far
Harness 3 TP8/DP1/EP=2 launch failure

The harness baseline was slightly lower than the original baseline (0.1892 vs 0.2029 req/s/gpu), but iter 2 still exceeded the original 12-trial best (0.3863 vs 0.3794 req/s/gpu).

Convergence Judgment

  • Before harness reached its best at iter 10.
  • Harness reached a better result at iter 2.
  • Iterations-to-best improved from 10 to 2, a 5x improvement on this run.
  • The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to TP8/DP1.

Follow-Up Optimization

The run also exposed a remaining weakness: after reaching the strong TP8/DP1 incumbent, the LLM proposed EP=2, which failed at launch. To address that, the harness was tightened after this test:

  • strong-incumbent stop threshold changed from 3x to 1.8x over baseline;
  • expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.

With the new guard, the intended behavior after this iter-2 result is should_stop=true unless a same-topology runtime harness has strong direct evidence.

Run Status

  • The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
  • GPUs were freed after stopping the run.