Record qwen235b harness convergence test

This commit is contained in:
2026-04-27 18:59:25 +08:00
parent bc884f6701
commit 71902b9fc2
3 changed files with 82 additions and 4 deletions

View File

@@ -45,7 +45,7 @@ The speedup comes from reducing wasted proposal families, not from changing the
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
2. Guarded stop after a strong incumbent
- If the newest trial is the incumbent and improves per-GPU throughput by at least `3x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
- If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
- With the guard, it emits `should_stop=true`.
@@ -79,5 +79,5 @@ Result:
## Current Risks
- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
- Strong-incumbent stopping is deliberately conservative for the qwen27b pattern. Workloads with narrow runtime sweet spots, such as qwen235b thinking prefill-only, may need a weaker stop threshold or a "continue local refinement" exception.
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.