Record qwen235b harness convergence test

This commit is contained in:
2026-04-27 18:59:25 +08:00
parent bc884f6701
commit 71902b9fc2
3 changed files with 82 additions and 4 deletions

View File

@@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
2. Guarded stop after a strong incumbent
- If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
- If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
- With the guard, it emits `should_stop=true`.
@@ -79,5 +79,5 @@ Result:
## Current Risks
- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.

View File

@@ -0,0 +1,58 @@
# qwen235b Thinking Prefill Harness Test
## Setup
- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
- Window: `thinking_w20260327_1000`.
- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
- Metric: best-so-far feasible `request_rate_per_gpu`.
- Before-harness source: actual 12-trial run
`.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
- Harness test source:
`.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.
## Result So Far
The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
## Trial Details
| Variant | Iter | Config | Result |
| --- | ---: | --- | --- |
| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
| Before harness | 4 | `EP=4` | launch failure |
| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
| Harness | 3 | `TP8/DP1/EP=2` | launch failure |
The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).
## Convergence Judgment
- Before harness reached its best at iter 10.
- Harness reached a better result at iter 2.
- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.
## Follow-Up Optimization
The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:
- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.
## Run Status
- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
- GPUs were freed after stopping the run.

View File

@@ -156,6 +156,25 @@ def _knob_harnesses(
"active_now": False,
}
)
if "expert-parallel-size" in tunable or "enable-expert-parallel" in tunable:
harnesses.append(
{
"knob_family": "expert-parallel",
"use_when": [
"Only when history or a capability profile identifies expert communication or MoE dispatch as the active bottleneck.",
],
"procedure": [
"Keep expert parallel disabled for pure TTFT/prefill tuning unless there is direct positive evidence for EP on this stack.",
"If EP is tested, change only EP-related knobs and treat launch/runtime failure as hard negative evidence.",
],
"guards": [
"Do not introduce EP as the first follow-up after a successful TP increase.",
"Do not use EP to address generic TTFT-prefill bottlenecks; TP and batching harnesses are the relevant families.",
"Do not enable EP after any launch failure involving expert-parallel knobs.",
],
"active_now": False,
}
)
if "gpu-memory-utilization" in tunable:
harnesses.append(
{
@@ -384,10 +403,10 @@ def _strong_incumbent_guard(
return default
gain = incumbent_rate / baseline_rate
latest = recent_diagnostics[-1] if recent_diagnostics else {}
if state.best_trial_id == latest.get("trial_id") and gain >= 3.0:
if state.best_trial_id == latest.get("trial_id") and gain >= 1.8:
return {
"guard_active": True,
"reason": "incumbent_exceeds_baseline_by_3x_and_latest_trial_is_best",
"reason": "incumbent_exceeds_baseline_by_1_8x_and_latest_trial_is_best",
"baseline_trial_id": baseline.get("trial_id"),
"baseline_request_rate_per_gpu": baseline_rate,
"incumbent_gain_vs_baseline": gain,
@@ -531,6 +550,7 @@ def _proposal_rules() -> list[str]:
"Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
"Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
"When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.",
"Do not enable expert parallel for TTFT-prefill bottlenecks unless expert-parallel is the active harness and there is direct positive EP evidence.",
"If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
"If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
"Never repeat an already tested config signature.",