From 71902b9fc2c6f77d883f15fd2db6ff4ec042c691 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Mon, 27 Apr 2026 18:59:25 +0800 Subject: [PATCH] Record qwen235b harness convergence test --- docs/aituner-harness-summary.md | 4 +- ...n235b-thinking-prefill-harness-20260427.md | 58 +++++++++++++++++++ src/aituner/harness.py | 24 +++++++- 3 files changed, 82 insertions(+), 4 deletions(-) create mode 100644 docs/qwen235b-thinking-prefill-harness-20260427.md diff --git a/docs/aituner-harness-summary.md b/docs/aituner-harness-summary.md index b454c80..1c0e8c1 100644 --- a/docs/aituner-harness-summary.md +++ b/docs/aituner-harness-summary.md @@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`. 2. Guarded stop after a strong incumbent - - If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks. + - If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks. - Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config. - With the guard, it emits `should_stop=true`. @@ -79,5 +79,5 @@ Result: ## Current Risks - The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly. -- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception. +- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config. - Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows. diff --git a/docs/qwen235b-thinking-prefill-harness-20260427.md b/docs/qwen235b-thinking-prefill-harness-20260427.md new file mode 100644 index 0000000..1902f6d --- /dev/null +++ b/docs/qwen235b-thinking-prefill-harness-20260427.md @@ -0,0 +1,58 @@ +# qwen235b Thinking Prefill Harness Test + +## Setup + +- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`. +- Window: `thinking_w20260327_1000`. +- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`. +- Metric: best-so-far feasible `request_rate_per_gpu`. +- Before-harness source: actual 12-trial run + `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`. +- Harness test source: + `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`. + +## Result So Far + +The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2. + +| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 | +| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | +| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 | +| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a | + +## Trial Details + +| Variant | Iter | Config | Result | +| --- | ---: | --- | --- | +| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` | +| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure | +| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure | +| Before harness | 4 | `EP=4` | launch failure | +| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` | +| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best | +| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` | +| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far | +| Harness | 3 | `TP8/DP1/EP=2` | launch failure | + +The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`). + +## Convergence Judgment + +- Before harness reached its best at iter 10. +- Harness reached a better result at iter 2. +- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run. +- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`. + +## Follow-Up Optimization + +The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test: + +- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline; +- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence. + +With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence. + +## Run Status + +- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure. +- GPUs were freed after stopping the run. diff --git a/src/aituner/harness.py b/src/aituner/harness.py index 5dbe239..8df9c12 100644 --- a/src/aituner/harness.py +++ b/src/aituner/harness.py @@ -156,6 +156,25 @@ def _knob_harnesses( "active_now": False, } ) + if "expert-parallel-size" in tunable or "enable-expert-parallel" in tunable: + harnesses.append( + { + "knob_family": "expert-parallel", + "use_when": [ + "Only when history or a capability profile identifies expert communication or MoE dispatch as the active bottleneck.", + ], + "procedure": [ + "Keep expert parallel disabled for pure TTFT/prefill tuning unless there is direct positive evidence for EP on this stack.", + "If EP is tested, change only EP-related knobs and treat launch/runtime failure as hard negative evidence.", + ], + "guards": [ + "Do not introduce EP as the first follow-up after a successful TP increase.", + "Do not use EP to address generic TTFT-prefill bottlenecks; TP and batching harnesses are the relevant families.", + "Do not enable EP after any launch failure involving expert-parallel knobs.", + ], + "active_now": False, + } + ) if "gpu-memory-utilization" in tunable: harnesses.append( { @@ -384,10 +403,10 @@ def _strong_incumbent_guard( return default gain = incumbent_rate / baseline_rate latest = recent_diagnostics[-1] if recent_diagnostics else {} - if state.best_trial_id == latest.get("trial_id") and gain >= 3.0: + if state.best_trial_id == latest.get("trial_id") and gain >= 1.8: return { "guard_active": True, - "reason": "incumbent_exceeds_baseline_by_3x_and_latest_trial_is_best", + "reason": "incumbent_exceeds_baseline_by_1_8x_and_latest_trial_is_best", "baseline_trial_id": baseline.get("trial_id"), "baseline_request_rate_per_gpu": baseline_rate, "incumbent_gain_vs_baseline": gain, @@ -531,6 +550,7 @@ def _proposal_rules() -> list[str]: "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.", "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.", "When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.", + "Do not enable expert parallel for TTFT-prefill bottlenecks unless expert-parallel is the active harness and there is direct positive EP evidence.", "If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.", "If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.", "Never repeat an already tested config signature.",