Document community vllm harness ablation

2026-05-02 11:17:24 +08:00
parent 4c066c4e4e
commit 915861b706
2 changed files with 50 additions and 4 deletions
--- a/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
+++ b/docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md
@@ -64,6 +64,7 @@ This run tests a stricter early-stop harness:
  - the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
 - If the stop guard fires, `study tune` writes `harness-stop-XXXX` and exits without spending another GPU trial or asking the LLM for another proposal.
 - A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.
+- A search-high saturation guard stops immediately when the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of `search.high`. In that case the current study cannot measure a better config without increasing the workload search range, so more config proposals only waste tuning iterations.

 This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.

@@ -75,7 +76,7 @@ Local test command:
 PYTHONPATH=src python3 -m unittest tests.test_core_flow -q
 ```

-Result: passed, 74 tests.
+Result: passed, 75 tests.

 The added coverage checks:

@@ -85,18 +86,42 @@ The added coverage checks:
 | `test_harness_stop_after_post_incumbent_validation_is_exhausted` | deterministic stop after validation exhaustion |
 | `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial |
 | `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context |
+| `test_harness_stop_when_incumbent_saturates_search_high` | deterministic stop when the incumbent saturates the configured workload search high |

 ## Experiment Tracking

-Pending dash0 runs:
+Completed dash0 runs:

 | Variant | tmux session | Log | Study root |
 | --- | --- | --- | --- |
 | no-harness | `qwen30b_vllm020_noharness_out128_scale01_20260502` | `logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness` |
-| harness | `qwen30b_vllm020_harness_out128_scale01_20260502` | `logs/qwen30b_vllm020_harness_out128_scale01_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness` |
+| harness | `qwen30b_vllm020_harness_highstop_gpu4_7_20260502` | `logs/qwen30b_vllm020_harness_highstop_gpu4_7_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness-highstop-gpu4-7` |

 The harness run should be judged by best-so-far `request_rate_per_gpu` per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.

 ## Results

-Pending. This section will be filled after the dash0 experiments finish.
+Metric: best-so-far `request_rate_per_gpu` under the bounded, time-compressed replay.
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| no-harness | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |
+| harness | 1.0333 | 1.0333 stop | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |
+
+Actual per-iteration outcomes:
+
+| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| no-harness | 1.0333 | 0.5167 | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
+| harness | 1.0333 | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop |
+
+Interpretation:
+
+- The best config is the default community vLLM config for this bounded replay. It reaches the configured search high: the last baseline probe at `sampling_u=0.1171875` is feasible, has pass rate `1.0`, and has no TTFT/TPOT SLO failures. With `search.high=0.125` and `max_probes=4`, this is exactly one binary-search resolution below the configured high.
+- The harness stops at iter 2 without calling the LLM or launching another GPU trial. This is not a claim that the engine is globally optimal; it is a claim that the current study cannot measure an improvement until `search.high` is increased.
+- No-harness spends all 12 tuning iterations anyway. Iter 2 changes to DP=2 and halves per-GPU throughput (`0.5167`). Iter 3-12 are launch failures from unguided or weakly guided proposals.
+- The harness therefore reaches the best measured config in one executed GPU trial and saves 11 tuning iterations on this setup.
+
+Operational note:
+
+- The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.