Add harness early stop ablation
This commit is contained in:
@@ -26,11 +26,11 @@ The harness turns each LLM proposal from open-ended config search into a bottlen
|
||||
- `gpu-memory-utilization`: memory headroom after topology and batching are stable.
|
||||
- Each family has `use_when`, `procedure`, `guards`, and `active_now` fields.
|
||||
|
||||
4. Proposal discipline
|
||||
4. Proposal discipline and early stop
|
||||
- The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
|
||||
- It must use adjacent legal topology choices and stay inside topology constraints.
|
||||
- It receives tested config signatures, so it should not repeat already-tried configs.
|
||||
- It can return `should_stop=true` when no adjacent harness-guided probe is justified.
|
||||
- A deterministic harness stop can now emit `should_stop=true` before calling the LLM when completed validation evidence says another trial is not justified.
|
||||
|
||||
5. Baseline-first loop
|
||||
- LLM-driven `study tune` now evaluates the initial engine config first unless `--skip-baseline` is passed.
|
||||
@@ -44,10 +44,10 @@ The speedup comes from reducing wasted proposal families, not from changing the
|
||||
- For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
|
||||
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
|
||||
|
||||
2. Guarded stop after a strong incumbent
|
||||
2. Guarded stop after validation, not immediately after a strong incumbent
|
||||
- If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
|
||||
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
|
||||
- With the guard, it emits `should_stop=true`.
|
||||
- It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement.
|
||||
- With the guard, `study tune` can write a `harness-stop-XXXX` proposal and exit without spending another GPU trial.
|
||||
|
||||
3. All-infeasible plateau detection
|
||||
- When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
|
||||
@@ -78,6 +78,6 @@ Result:
|
||||
|
||||
## Current Risks
|
||||
|
||||
- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
|
||||
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
|
||||
- The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in `study tune`, but proposal-family blocking is not yet enforced by a separate validator.
|
||||
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
|
||||
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
|
||||
|
||||
Reference in New Issue
Block a user