Document community vllm harness ablation
This commit is contained in:
@@ -59,6 +59,10 @@ The speedup comes from reducing wasted proposal families, not from changing the
|
||||
- Default behavior drains in-flight requests for comparable production runs.
|
||||
- Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.
|
||||
|
||||
5. Search-high saturation stop
|
||||
- If the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of `search.high`, the harness stops before asking the LLM for another proposal.
|
||||
- This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.
|
||||
|
||||
## qwen27b 0-8k Evidence
|
||||
|
||||
Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`.
|
||||
@@ -82,3 +86,20 @@ Result:
|
||||
- The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in `study tune`, but proposal-family blocking is not yet enforced by a separate validator.
|
||||
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
|
||||
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
|
||||
|
||||
## Qwen3-30B-A3B Community vLLM Evidence
|
||||
|
||||
Source: `docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md`.
|
||||
|
||||
Metric: best-so-far feasible `request_rate_per_gpu` on the bounded 0-8k chat replay with 128 output tokens and `replay_time_scale=0.1`.
|
||||
|
||||
| Variant | Iter 1 | Iter 2 | Iter 3-12 |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| no-harness | 1.0333 | 1.0333 | 1.0333 |
|
||||
| harness | 1.0333 | 1.0333 stop | 1.0333 |
|
||||
|
||||
Result:
|
||||
|
||||
- Both variants found the same best measured config: the default community vLLM launch.
|
||||
- Harness stopped at iter 2 because the incumbent saturated `search.high`; no LLM proposal or GPU trial was needed after baseline.
|
||||
- No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.
|
||||
|
||||
Reference in New Issue
Block a user