From 71902b9fc2c6f77d883f15fd2db6ff4ec042c691 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Mon, 27 Apr 2026 18:59:25 +0800
Subject: [PATCH] Record qwen235b harness convergence test

---
 docs/aituner-harness-summary.md               |  4 +-
 ...n235b-thinking-prefill-harness-20260427.md | 58 +++++++++++++++++++
 src/aituner/harness.py                        | 24 +++++++-
 3 files changed, 82 insertions(+), 4 deletions(-)
 create mode 100644 docs/qwen235b-thinking-prefill-harness-20260427.md

diff --git a/docs/aituner-harness-summary.md b/docs/aituner-harness-summary.md
index b454c80..1c0e8c1 100644
--- a/docs/aituner-harness-summary.md
+++ b/docs/aituner-harness-summary.md
@@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the
    - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
 
 2. Guarded stop after a strong incumbent
-   - If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
+   - If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
    - Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
    - With the guard, it emits `should_stop=true`.
 
@@ -79,5 +79,5 @@ Result:
 ## Current Risks
 
 - The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
-- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
+- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
 - Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
diff --git a/docs/qwen235b-thinking-prefill-harness-20260427.md b/docs/qwen235b-thinking-prefill-harness-20260427.md
new file mode 100644
index 0000000..1902f6d
--- /dev/null
+++ b/docs/qwen235b-thinking-prefill-harness-20260427.md
@@ -0,0 +1,58 @@
+# qwen235b Thinking Prefill Harness Test
+
+## Setup
+
+- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
+- Window: `thinking_w20260327_1000`.
+- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
+- Metric: best-so-far feasible `request_rate_per_gpu`.
+- Before-harness source: actual 12-trial run
+  `.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
+- Harness test source:
+  `.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.
+
+## Result So Far
+
+The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
+| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
+
+## Trial Details
+
+| Variant | Iter | Config | Result |
+| --- | ---: | --- | --- |
+| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
+| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
+| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
+| Before harness | 4 | `EP=4` | launch failure |
+| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
+| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
+| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
+| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
+| Harness | 3 | `TP8/DP1/EP=2` | launch failure |
+
+The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).
+
+## Convergence Judgment
+
+- Before harness reached its best at iter 10.
+- Harness reached a better result at iter 2.
+- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
+- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.
+
+## Follow-Up Optimization
+
+The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:
+
+- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
+- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
+
+With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.
+
+## Run Status
+
+- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
+- GPUs were freed after stopping the run.
diff --git a/src/aituner/harness.py b/src/aituner/harness.py
index 5dbe239..8df9c12 100644
--- a/src/aituner/harness.py
+++ b/src/aituner/harness.py
@@ -156,6 +156,25 @@ def _knob_harnesses(
                 "active_now": False,
             }
         )
+    if "expert-parallel-size" in tunable or "enable-expert-parallel" in tunable:
+        harnesses.append(
+            {
+                "knob_family": "expert-parallel",
+                "use_when": [
+                    "Only when history or a capability profile identifies expert communication or MoE dispatch as the active bottleneck.",
+                ],
+                "procedure": [
+                    "Keep expert parallel disabled for pure TTFT/prefill tuning unless there is direct positive evidence for EP on this stack.",
+                    "If EP is tested, change only EP-related knobs and treat launch/runtime failure as hard negative evidence.",
+                ],
+                "guards": [
+                    "Do not introduce EP as the first follow-up after a successful TP increase.",
+                    "Do not use EP to address generic TTFT-prefill bottlenecks; TP and batching harnesses are the relevant families.",
+                    "Do not enable EP after any launch failure involving expert-parallel knobs.",
+                ],
+                "active_now": False,
+            }
+        )
     if "gpu-memory-utilization" in tunable:
         harnesses.append(
             {
@@ -384,10 +403,10 @@ def _strong_incumbent_guard(
         return default
     gain = incumbent_rate / baseline_rate
     latest = recent_diagnostics[-1] if recent_diagnostics else {}
-    if state.best_trial_id == latest.get("trial_id") and gain >= 3.0:
+    if state.best_trial_id == latest.get("trial_id") and gain >= 1.8:
         return {
             "guard_active": True,
-            "reason": "incumbent_exceeds_baseline_by_3x_and_latest_trial_is_best",
+            "reason": "incumbent_exceeds_baseline_by_1_8x_and_latest_trial_is_best",
             "baseline_trial_id": baseline.get("trial_id"),
             "baseline_request_rate_per_gpu": baseline_rate,
             "incumbent_gain_vs_baseline": gain,
@@ -531,6 +550,7 @@ def _proposal_rules() -> list[str]:
         "Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
         "Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
         "When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.",
+        "Do not enable expert parallel for TTFT-prefill bottlenecks unless expert-parallel is the active harness and there is direct positive EP evidence.",
         "If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
         "If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
         "Never repeat an already tested config signature.",