Add harness early stop ablation

2026-05-02 08:08:14 +08:00
parent 6d3459c82d
commit 1a3d628268
9 changed files with 837 additions and 29 deletions
--- a/docs/aituner-harness-summary.md
+++ b/docs/aituner-harness-summary.md
@@ -26,11 +26,11 @@ The harness turns each LLM proposal from open-ended config search into a bottlen
     - `gpu-memory-utilization`: memory headroom after topology and batching are stable.
   - Each family has `use_when`, `procedure`, `guards`, and `active_now` fields.

-4. Proposal discipline
+4. Proposal discipline and early stop
   - The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
   - It must use adjacent legal topology choices and stay inside topology constraints.
   - It receives tested config signatures, so it should not repeat already-tried configs.
-   - It can return `should_stop=true` when no adjacent harness-guided probe is justified.
+   - A deterministic harness stop can now emit `should_stop=true` before calling the LLM when completed validation evidence says another trial is not justified.

 5. Baseline-first loop
   - LLM-driven `study tune` now evaluates the initial engine config first unless `--skip-baseline` is passed.
@@ -44,10 +44,10 @@ The speedup comes from reducing wasted proposal families, not from changing the
   - For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
   - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.

-2. Guarded stop after a strong incumbent
+2. Guarded stop after validation, not immediately after a strong incumbent
   - If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
-   - Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
-   - With the guard, it emits `should_stop=true`.
+   - It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement.
+   - With the guard, `study tune` can write a `harness-stop-XXXX` proposal and exit without spending another GPU trial.

 3. All-infeasible plateau detection
   - When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
@@ -78,6 +78,6 @@ Result:

 ## Current Risks

- The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
+- The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in `study tune`, but proposal-family blocking is not yet enforced by a separate validator.
+- Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
 - Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
--- a/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
+++ b/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
@@ -0,0 +1,93 @@
+# Qwen3-30B-A3B Community vLLM Harness Ablation, 2026-05-02
+
+## Goal
+
+Run a fresh dash0 experiment on the community vLLM latest release with the local community model:
+
+`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+
+The comparison is:
+
+| Variant | Spec | Harness |
+| --- | --- | --- |
+| no-harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_noharness.json` | disabled via `llm.use_harness=false` |
+| harness | `configs/examples/dash0_qwen30b_a3b_community_vllm020_harness.json` | enabled, including deterministic stop proposal |
+
+Both specs start from the same base vLLM configuration. The base contains only serving access fields: `host`, `port`, and `served-model-name`. It does not set performance flags such as TP, DP, EP, max model length, prefix cache, chunked prefill, max-num-seqs, max-num-batched-tokens, or gpu-memory-utilization. The first trial therefore measures community vLLM defaults for this model.
+
+## vLLM Install
+
+PyPI reports `vllm==0.20.0` as the current community release checked on 2026-05-02. The dash0 installation target is:
+
+`/home/admin/cpfs/wjh/venvs/vllm-0.20.0`
+
+Install log:
+
+`/home/admin/cpfs/wjh/aituner/aituner/logs/install_vllm_0.20.0_20260502.log`
+
+## Workload
+
+The experiment reuses the 0-8k chat window that has already been used for qwen27b harness work:
+
+| Field | Value |
+| --- | --- |
+| window | `chat_w20260311_1000` |
+| source rows | 32606 |
+| input filter | 0 to 8192 tokens |
+| max requests per probe | 2048 |
+| target pass rate | 0.95 |
+| TTFT SLO | 2s up to 4k, 4s up to 32k, 6s above |
+| TPOT SLO | 50ms |
+| search high | 0.125 sampling_u |
+| max probes per trial | 6 |
+
+The `max_requests_per_probe=2048` cap keeps the fresh community-vLLM ablation practical while preserving a real trace-shaped replay, SLO scoring, and binary-search threshold probe.
+
+## Harness Update Under Test
+
+This run tests a stricter early-stop harness:
+
+- The harness still injects L-C-A workload features, recent trial diagnostics, active bottleneck, legal topology candidates, tested signatures, and knob-family rules.
+- A strong incumbent no longer means immediate stop. It means "validate nearby alternatives".
+- Deterministic stop is allowed only after completed validation evidence says continuing is unlikely to be useful:
+  - the incumbent beats baseline by a generic large-gain ratio,
+  - at least two post-incumbent validation trials have run,
+  - those validation trials did not produce a feasible per-GPU improvement,
+  - the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
+- If the stop guard fires, `study tune` writes `harness-stop-XXXX` and exits without spending another GPU trial or asking the LLM for another proposal.
+
+This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.
+
+## Unit Tests
+
+Local test command:
+
+```bash
+PYTHONPATH=src python3 -m unittest tests.test_core_flow -q
+```
+
+Result: passed, 74 tests.
+
+The added coverage checks:
+
+| Test | Purpose |
+| --- | --- |
+| `test_harness_does_not_stop_immediately_after_strong_incumbent` | strong incumbent requires validation first |
+| `test_harness_stop_after_post_incumbent_validation_is_exhausted` | deterministic stop after validation exhaustion |
+| `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial |
+| `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context |
+
+## Experiment Tracking
+
+Pending dash0 runs:
+
+| Variant | tmux session | Log | Study root |
+| --- | --- | --- | --- |
+| no-harness | `qwen30b_vllm020_noharness_20260502` | `logs/qwen30b_vllm020_noharness_20260502.log` | `.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-noharness` |
+| harness | `qwen30b_vllm020_harness_20260502` | `logs/qwen30b_vllm020_harness_20260502.log` | `.aituner-community-vllm020/studies/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-harness` |
+
+The harness run should be judged by best-so-far `request_rate_per_gpu` per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.
+
+## Results
+
+Pending. This section will be filled after the dash0 experiments finish.