Record qwen235b harness convergence test

2026-04-27 18:59:25 +08:00
parent bc884f6701
commit 71902b9fc2
3 changed files with 82 additions and 4 deletions
--- a/docs/aituner-harness-summary.md
+++ b/docs/aituner-harness-summary.md
@@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the
   - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
 2. Guarded stop after a strong incumbent
-   - If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
+   - If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
   - Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
   - With the guard, it emits `should_stop=true`.
@@ -79,5 +79,5 @@ Result:
 ## Current Risks
 - The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
+- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
 - Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
--- a/docs/qwen235b-thinking-prefill-harness-20260427.md
+++ b/docs/qwen235b-thinking-prefill-harness-20260427.md
@@ -0,0 +1,58 @@
 # qwen235b Thinking Prefill Harness Test
 ## Setup
 - Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
 - Window: `thinking_w20260327_1000`.
 - SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
 - Metric: best-so-far feasible `request_rate_per_gpu`.
 - Before-harness source: actual 12-trial run
  `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
 - Harness test source:
  `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.
 ## Result So Far
 The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
 | Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
 | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
 | Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
 | Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
 ## Trial Details
 | Variant | Iter | Config | Result |
 | --- | ---: | --- | --- |
 | Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
 | Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
 | Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
 | Before harness | 4 | `EP=4` | launch failure |
 | Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
 | Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
 | Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
 | Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
 | Harness | 3 | `TP8/DP1/EP=2` | launch failure |
 The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).
 ## Convergence Judgment
 - Before harness reached its best at iter 10.
 - Harness reached a better result at iter 2.
 - Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
 - The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.
 ## Follow-Up Optimization
 The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:
 - strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
 - expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
 With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.
 ## Run Status
 - The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
 - GPUs were freed after stopping the run.
--- a/src/aituner/harness.py
+++ b/src/aituner/harness.py
@@ -156,6 +156,25 @@ def _knob_harnesses(
                "active_now": False,
            }
        )
    if "expert-parallel-size" in tunable or "enable-expert-parallel" in tunable:
        harnesses.append(
            {
                "knob_family": "expert-parallel",
                "use_when": [
                    "Only when history or a capability profile identifies expert communication or MoE dispatch as the active bottleneck.",
                ],
                "procedure": [
                    "Keep expert parallel disabled for pure TTFT/prefill tuning unless there is direct positive evidence for EP on this stack.",
                    "If EP is tested, change only EP-related knobs and treat launch/runtime failure as hard negative evidence.",
                ],
                "guards": [
                    "Do not introduce EP as the first follow-up after a successful TP increase.",
                    "Do not use EP to address generic TTFT-prefill bottlenecks; TP and batching harnesses are the relevant families.",
                    "Do not enable EP after any launch failure involving expert-parallel knobs.",
                ],
                "active_now": False,
            }
        )
    if "gpu-memory-utilization" in tunable:
        harnesses.append(
            {
@@ -384,10 +403,10 @@ def _strong_incumbent_guard(
        return default
    gain = incumbent_rate / baseline_rate
    latest = recent_diagnostics[-1] if recent_diagnostics else {}
-    if state.best_trial_id == latest.get("trial_id") and gain >= 3.0:
+    if state.best_trial_id == latest.get("trial_id") and gain >= 1.8:
        return {
            "guard_active": True,
-            "reason": "incumbent_exceeds_baseline_by_3x_and_latest_trial_is_best",
+            "reason": "incumbent_exceeds_baseline_by_1_8x_and_latest_trial_is_best",
            "baseline_trial_id": baseline.get("trial_id"),
            "baseline_request_rate_per_gpu": baseline_rate,
            "incumbent_gain_vs_baseline": gain,
@@ -531,6 +550,7 @@ def _proposal_rules() -> list[str]:
        "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
        "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
        "When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.",
        "Do not enable expert parallel for TTFT-prefill bottlenecks unless expert-parallel is the active harness and there is direct positive EP evidence.",
        "If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
        "If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
        "Never repeat an already tested config signature.",