Record qwen235b harness convergence test
This commit is contained in:
@@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the
|
|||||||
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
|
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
|
||||||
|
|
||||||
2. Guarded stop after a strong incumbent
|
2. Guarded stop after a strong incumbent
|
||||||
- If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
|
- If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
|
||||||
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
|
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
|
||||||
- With the guard, it emits `should_stop=true`.
|
- With the guard, it emits `should_stop=true`.
|
||||||
|
|
||||||
@@ -79,5 +79,5 @@ Result:
|
|||||||
## Current Risks
|
## Current Risks
|
||||||
|
|
||||||
- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
|
- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
|
||||||
- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
|
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
|
||||||
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
|
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
|
||||||
|
|||||||
58
docs/qwen235b-thinking-prefill-harness-20260427.md
Normal file
58
docs/qwen235b-thinking-prefill-harness-20260427.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
# qwen235b Thinking Prefill Harness Test
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Workload: `qwen3-235b-a22b` thinking trace, prefill-only replay with `min_tokens=max_tokens=1`.
|
||||||
|
- Window: `thinking_w20260327_1000`.
|
||||||
|
- SLO: 95% pass rate, stepped TTFT `3s/6s/9s`.
|
||||||
|
- Metric: best-so-far feasible `request_rate_per_gpu`.
|
||||||
|
- Before-harness source: actual 12-trial run
|
||||||
|
`.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology`.
|
||||||
|
- Harness test source:
|
||||||
|
`.aituner/harness-qwen235b-prefill-20260427/dash0-qwen235b-prefill-thinking-harness-run1-20260427`.
|
||||||
|
|
||||||
|
## Result So Far
|
||||||
|
|
||||||
|
The harness run was stopped after establishing the convergence result and observing the next weak proposal. The useful comparison is already visible by iter 2.
|
||||||
|
|
||||||
|
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
||||||
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| Before harness, actual run1 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.2029 | 0.3575 | 0.3575 | 0.3708 | 0.3708 | 0.3794 | 0.3794 | 0.3794 |
|
||||||
|
| Harness, actual 2026-04-27 run | 0.1892 | 0.3863 | 0.3863 | 0.3863 | n/a | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
|
||||||
|
|
||||||
|
## Trial Details
|
||||||
|
|
||||||
|
| Variant | Iter | Config | Result |
|
||||||
|
| --- | ---: | --- | --- |
|
||||||
|
| Before harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.2029 req/s/gpu` |
|
||||||
|
| Before harness | 2 | `DP=2`, `MBT=4096` | runtime failure |
|
||||||
|
| Before harness | 3 | `DP=2`, `MBT=8192` | runtime failure |
|
||||||
|
| Before harness | 4 | `EP=4` | launch failure |
|
||||||
|
| Before harness | 6 | `TP8/DP1/EP-off`, `MBT=4096` | `0.3575 req/s/gpu` |
|
||||||
|
| Before harness | 10 | `TP8/DP1/EP-off`, `MBT=3712` | `0.3794 req/s/gpu`, best |
|
||||||
|
| Harness | 1 | baseline `TP4/DP1/EP-off`, `MBT=8192` | `0.1892 req/s/gpu` |
|
||||||
|
| Harness | 2 | `TP8/DP1/EP-off`, `MBT=8192` | `0.3863 req/s/gpu`, best so far |
|
||||||
|
| Harness | 3 | `TP8/DP1/EP=2` | launch failure |
|
||||||
|
|
||||||
|
The harness baseline was slightly lower than the original baseline (`0.1892` vs `0.2029 req/s/gpu`), but iter 2 still exceeded the original 12-trial best (`0.3863` vs `0.3794 req/s/gpu`).
|
||||||
|
|
||||||
|
## Convergence Judgment
|
||||||
|
|
||||||
|
- Before harness reached its best at iter 10.
|
||||||
|
- Harness reached a better result at iter 2.
|
||||||
|
- Iterations-to-best improved from `10` to `2`, a `5x` improvement on this run.
|
||||||
|
- The important behavior change is that the harness skipped the original failed DP2 and EP4 exploration and moved directly from baseline to `TP8/DP1`.
|
||||||
|
|
||||||
|
## Follow-Up Optimization
|
||||||
|
|
||||||
|
The run also exposed a remaining weakness: after reaching the strong `TP8/DP1` incumbent, the LLM proposed `EP=2`, which failed at launch. To address that, the harness was tightened after this test:
|
||||||
|
|
||||||
|
- strong-incumbent stop threshold changed from `3x` to `1.8x` over baseline;
|
||||||
|
- expert parallel is now explicitly guarded and should not be introduced for TTFT-prefill bottlenecks without direct positive EP evidence.
|
||||||
|
|
||||||
|
With the new guard, the intended behavior after this iter-2 result is `should_stop=true` unless a same-topology runtime harness has strong direct evidence.
|
||||||
|
|
||||||
|
## Run Status
|
||||||
|
|
||||||
|
- The 2026-04-27 harness run was stopped after collecting the iter-2 convergence result and the iter-3 EP failure.
|
||||||
|
- GPUs were freed after stopping the run.
|
||||||
@@ -156,6 +156,25 @@ def _knob_harnesses(
|
|||||||
"active_now": False,
|
"active_now": False,
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
if "expert-parallel-size" in tunable or "enable-expert-parallel" in tunable:
|
||||||
|
harnesses.append(
|
||||||
|
{
|
||||||
|
"knob_family": "expert-parallel",
|
||||||
|
"use_when": [
|
||||||
|
"Only when history or a capability profile identifies expert communication or MoE dispatch as the active bottleneck.",
|
||||||
|
],
|
||||||
|
"procedure": [
|
||||||
|
"Keep expert parallel disabled for pure TTFT/prefill tuning unless there is direct positive evidence for EP on this stack.",
|
||||||
|
"If EP is tested, change only EP-related knobs and treat launch/runtime failure as hard negative evidence.",
|
||||||
|
],
|
||||||
|
"guards": [
|
||||||
|
"Do not introduce EP as the first follow-up after a successful TP increase.",
|
||||||
|
"Do not use EP to address generic TTFT-prefill bottlenecks; TP and batching harnesses are the relevant families.",
|
||||||
|
"Do not enable EP after any launch failure involving expert-parallel knobs.",
|
||||||
|
],
|
||||||
|
"active_now": False,
|
||||||
|
}
|
||||||
|
)
|
||||||
if "gpu-memory-utilization" in tunable:
|
if "gpu-memory-utilization" in tunable:
|
||||||
harnesses.append(
|
harnesses.append(
|
||||||
{
|
{
|
||||||
@@ -384,10 +403,10 @@ def _strong_incumbent_guard(
|
|||||||
return default
|
return default
|
||||||
gain = incumbent_rate / baseline_rate
|
gain = incumbent_rate / baseline_rate
|
||||||
latest = recent_diagnostics[-1] if recent_diagnostics else {}
|
latest = recent_diagnostics[-1] if recent_diagnostics else {}
|
||||||
if state.best_trial_id == latest.get("trial_id") and gain >= 3.0:
|
if state.best_trial_id == latest.get("trial_id") and gain >= 1.8:
|
||||||
return {
|
return {
|
||||||
"guard_active": True,
|
"guard_active": True,
|
||||||
"reason": "incumbent_exceeds_baseline_by_3x_and_latest_trial_is_best",
|
"reason": "incumbent_exceeds_baseline_by_1_8x_and_latest_trial_is_best",
|
||||||
"baseline_trial_id": baseline.get("trial_id"),
|
"baseline_trial_id": baseline.get("trial_id"),
|
||||||
"baseline_request_rate_per_gpu": baseline_rate,
|
"baseline_request_rate_per_gpu": baseline_rate,
|
||||||
"incumbent_gain_vs_baseline": gain,
|
"incumbent_gain_vs_baseline": gain,
|
||||||
@@ -531,6 +550,7 @@ def _proposal_rules() -> list[str]:
|
|||||||
"Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
|
"Pick at most one primary knob family from knob_harnesses unless the history proves a coupled change is needed.",
|
||||||
"Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
|
"Use adjacent legal values around the incumbent; avoid broad exploratory jumps.",
|
||||||
"When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.",
|
"When strong_incumbent.guard_active is true, do not propose runtime-only tweaks unless the relevant harness guard is positively satisfied by same-topology evidence.",
|
||||||
|
"Do not enable expert parallel for TTFT-prefill bottlenecks unless expert-parallel is the active harness and there is direct positive EP evidence.",
|
||||||
"If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
|
"If infeasible_progress blocks the last primary knob family, do not continue that family; switch families with direct bottleneck evidence or set should_stop=true.",
|
||||||
"If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
|
"If a proposed config is likely to reduce request_rate_per_gpu under the active guard, set should_stop=true instead of exploring.",
|
||||||
"Never repeat an already tested config signature.",
|
"Never repeat an already tested config signature.",
|
||||||
|
|||||||
Reference in New Issue
Block a user