Document high search rerun
This commit is contained in:
@@ -63,6 +63,10 @@ The speedup comes from reducing wasted proposal families, not from changing the
|
||||
- If the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of `search.high`, the harness stops before asking the LLM for another proposal.
|
||||
- This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.
|
||||
|
||||
6. Deterministic first probes
|
||||
- After a baseline latency bottleneck, the harness can propose the adjacent legal TP increase before asking the LLM.
|
||||
- After a TP incumbent improves per-GPU throughput, the harness keeps that topology and applies a same-topology runtime seed before trying DP/EP or broad runtime changes.
|
||||
|
||||
## qwen27b 0-8k Evidence
|
||||
|
||||
Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`.
|
||||
@@ -93,6 +97,8 @@ Source: `docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md`
|
||||
|
||||
Metric: best-so-far feasible `request_rate_per_gpu` on the bounded 0-8k chat replay with 128 output tokens and `replay_time_scale=0.1`.
|
||||
|
||||
Initial `search.high=0.125` result:
|
||||
|
||||
| Variant | Iter 1 | Iter 2 | Iter 3-12 |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| no-harness | 1.0333 | 1.0333 | 1.0333 |
|
||||
@@ -103,3 +109,17 @@ Result:
|
||||
- Both variants found the same best measured config: the default community vLLM launch.
|
||||
- Harness stopped at iter 2 because the incumbent saturated `search.high`; no LLM proposal or GPU trial was needed after baseline.
|
||||
- No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.
|
||||
- This was a measurement-ceiling result, not proof of global optimality.
|
||||
|
||||
High `search.high=1.0` rerun:
|
||||
|
||||
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8-12 |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||
| no-harness | 2.2000 | 3.2583 | 3.2583 | 3.2583 | 3.2583 | 3.3000 | 3.3500 | 3.3500 |
|
||||
| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | 3.3000 stop | 3.3000 | 3.3000 | 3.3000 |
|
||||
|
||||
Result:
|
||||
|
||||
- Raising `search.high` showed that default vLLM was not actually optimal; the prior run was capped by workload range.
|
||||
- Harness reached the same TP2/runtime config family in 4 iterations instead of 7 by making deterministic first TP and same-topology runtime proposals.
|
||||
- The single-run best value differs by about 1.5% (`3.3000` vs `3.3500`) for the same config family, so this should be interpreted as faster convergence to the same region, not an exact single-run throughput win.
|
||||
|
||||
Reference in New Issue
Block a user