Record qwen235b harness convergence test

2026-04-27 18:59:25 +08:00
parent bc884f6701
commit 71902b9fc2
3 changed files with 82 additions and 4 deletions
--- a/docs/qwen235b-thinking-prefill-harness-20260427.md
+++ b/docs/qwen235b-thinking-prefill-harness-20260427.md
@@ -0,0 +1,58 @@
+# qwen235b Thinking Prefill Harness Test
+
+## Setup
+
+- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
+- Window: `thinking_w20260327_1000`.
+- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
+- Metric: best-so-far feasible `request_rate_per_gpu`.
+- Before-harness source: actual 12-trial run
+  `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
+- Harness test source:
+  `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.
+
+## Result So Far
+
+The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
+| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
+
+## Trial Details
+
+| Variant | Iter | Config | Result |
+| --- | ---: | --- | --- |
+| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
+| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
+| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
+| Before harness | 4 | `EP=4` | launch failure |
+| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
+| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
+| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
+| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
+| Harness | 3 | `TP8/DP1/EP=2` | launch failure |
+
+The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).
+
+## Convergence Judgment
+
+- Before harness reached its best at iter 10.
+- Harness reached a better result at iter 2.
+- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
+- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.
+
+## Follow-Up Optimization
+
+The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:
+
+- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
+- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
+
+With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.
+
+## Run Status
+
+- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
+- GPUs were freed after stopping the run.