Document community vllm harness ablation
This commit is contained in:
@@ -64,6 +64,7 @@ This run tests a stricter early-stop harness:
|
||||
- the validation covered topology and runtime families, or accumulated at least three post-incumbent validation attempts.
|
||||
- If the stop guard fires, `study tune` writes `harness-stop-XXXX` and exits without spending another GPU trial or asking the LLM for another proposal.
|
||||
- A single-family all-infeasible plateau is not enough to stop deterministically. It only blocks repeating that family; the LLM must either justify a different family or later satisfy the validation/convergence stop rule.
|
||||
- A search-high saturation guard stops immediately when the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of `search.high`. In that case the current study cannot measure a better config without increasing the workload search range, so more config proposals only waste tuning iterations.
|
||||
|
||||
This is a generic harness rule, not a testcase-specific threshold. It does not depend on qwen27b, qwen235b, qwen30b, a fixed TP/DP value, or a hardcoded SLO number.
|
||||
|
||||
@@ -75,7 +76,7 @@ Local test command:
|
||||
PYTHONPATH=src python3 -m unittest tests.test_core_flow -q
|
||||
```
|
||||
|
||||
Result: passed, 74 tests.
|
||||
Result: passed, 75 tests.
|
||||
|
||||
The added coverage checks:
|
||||
|
||||
@@ -85,18 +86,42 @@ The added coverage checks:
|
||||
| `test_harness_stop_after_post_incumbent_validation_is_exhausted` | deterministic stop after validation exhaustion |
|
||||
| `test_cli_tune_uses_harness_stop_before_llm` | `study tune` can stop without calling the LLM or launching another GPU trial |
|
||||
| `test_prompt_can_disable_harness_for_ablation` | no-harness prompt removes structured harness context |
|
||||
| `test_harness_stop_when_incumbent_saturates_search_high` | deterministic stop when the incumbent saturates the configured workload search high |
|
||||
|
||||
## Experiment Tracking
|
||||
|
||||
Pending dash0 runs:
|
||||
Completed dash0 runs:
|
||||
|
||||
| Variant | tmux session | Log | Study root |
|
||||
| --- | --- | --- | --- |
|
||||
| no-harness | `qwen30b_vllm020_noharness_out128_scale01_20260502` | `logs/qwen30b_vllm020_noharness_out128_scale01_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-noharness` |
|
||||
| harness | `qwen30b_vllm020_harness_out128_scale01_20260502` | `logs/qwen30b_vllm020_harness_out128_scale01_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness` |
|
||||
| harness | `qwen30b_vllm020_harness_highstop_gpu4_7_20260502` | `logs/qwen30b_vllm020_harness_highstop_gpu4_7_20260502.log` | `.aituner-community-vllm020/dash0-qwen30b-a3b-community-vllm020-chat-0-8k-out128-scale01-harness-highstop-gpu4-7` |
|
||||
|
||||
The harness run should be judged by best-so-far `request_rate_per_gpu` per tuning iteration, plus whether it stops only after validation evidence. The no-harness run should use the same trial budget so the ablation exposes whether the early-stop harness saves iterations without hiding a later better point.
|
||||
|
||||
## Results
|
||||
|
||||
Pending. This section will be filled after the dash0 experiments finish.
|
||||
Metric: best-so-far `request_rate_per_gpu` under the bounded, time-compressed replay.
|
||||
|
||||
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| no-harness | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |
|
||||
| harness | 1.0333 | 1.0333 stop | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 | 1.0333 |
|
||||
|
||||
Actual per-iteration outcomes:
|
||||
|
||||
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8 | Iter 9 | Iter 10 | Iter 11 | Iter 12 |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| no-harness | 1.0333 | 0.5167 | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail |
|
||||
| harness | 1.0333 | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop | stop |
|
||||
|
||||
Interpretation:
|
||||
|
||||
- The best config is the default community vLLM config for this bounded replay. It reaches the configured search high: the last baseline probe at `sampling_u=0.1171875` is feasible, has pass rate `1.0`, and has no TTFT/TPOT SLO failures. With `search.high=0.125` and `max_probes=4`, this is exactly one binary-search resolution below the configured high.
|
||||
- The harness stops at iter 2 without calling the LLM or launching another GPU trial. This is not a claim that the engine is globally optimal; it is a claim that the current study cannot measure an improvement until `search.high` is increased.
|
||||
- No-harness spends all 12 tuning iterations anyway. Iter 2 changes to DP=2 and halves per-GPU throughput (`0.5167`). Iter 3-12 are launch failures from unguided or weakly guided proposals.
|
||||
- The harness therefore reaches the best measured config in one executed GPU trial and saves 11 tuning iterations on this setup.
|
||||
|
||||
Operational note:
|
||||
|
||||
- The no-harness run left driver-side orphan GPU memory on GPU0/1 after repeated launch failures. An earlier pre-high-stop harness attempt left the same kind of residue on GPU2/3. The final harness run was executed on dash0 GPU4-7 via a runtime-derived spec to avoid this contamination. Its executed GPU trial used a single H20, matching the no-harness best trial's single-GPU default configuration.
|
||||
|
||||
Reference in New Issue
Block a user