126 lines
7.9 KiB
Markdown
126 lines
7.9 KiB
Markdown
# AITuner Harness Summary
|
|
|
|
## What The Harness Adds
|
|
|
|
The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.
|
|
|
|
1. Workload profile
|
|
- Extracts L-C-A features from the trace window:
|
|
- L: prompt length percentiles and tail ratio.
|
|
- C: prefix/cache reuse estimates from `hash_ids` when available.
|
|
- A: request rate, burst ratio, and interarrival variation.
|
|
- These features are injected into the prompt as a structured `Harnesses` section.
|
|
|
|
2. Trial diagnostics
|
|
- Reads recent trial result JSON.
|
|
- Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts.
|
|
- Classifies the active bottleneck as `ttft_prefill`, `decode_tpot`, `admission_or_queueing`, `launch_or_memory`, or unknown.
|
|
|
|
3. Knob-family harnesses
|
|
- Maps bottlenecks to a small number of plausible knob families.
|
|
- Current harness families:
|
|
- `tensor-parallel-size`: long-prompt TTFT/prefill bottlenecks.
|
|
- `max-num-batched-tokens`: prefill batching or fragmentation, with trust-region guards.
|
|
- `max-num-seqs`: cache-heavy or admission-limited workloads.
|
|
- `enable-chunked-prefill`: long-tail prompt blocking.
|
|
- `gpu-memory-utilization`: memory headroom after topology and batching are stable.
|
|
- Each family has `use_when`, `procedure`, `guards`, and `active_now` fields.
|
|
|
|
4. Proposal discipline and early stop
|
|
- The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
|
|
- It must use adjacent legal topology choices and stay inside topology constraints.
|
|
- It receives tested config signatures, so it should not repeat already-tried configs.
|
|
- A deterministic harness stop can now emit `should_stop=true` before calling the LLM when completed validation evidence says another trial is not justified.
|
|
|
|
5. Baseline-first loop
|
|
- LLM-driven `study tune` now evaluates the initial engine config first unless `--skip-baseline` is passed.
|
|
- This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config.
|
|
|
|
## What Accelerates Convergence
|
|
|
|
The speedup comes from reducing wasted proposal families, not from changing the benchmark metric.
|
|
|
|
1. Topology-before-runtime on prefill bottlenecks
|
|
- For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
|
|
- Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`.
|
|
|
|
2. Guarded stop after validation, not immediately after a strong incumbent
|
|
- If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks.
|
|
- It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement.
|
|
- With the guard, `study tune` can write a `harness-stop-XXXX` proposal and exit without spending another GPU trial.
|
|
|
|
3. All-infeasible plateau detection
|
|
- When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
|
|
- This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau.
|
|
- Plateau alone does not trigger deterministic early stop; it forces either a different justified family or a later validation/convergence stop.
|
|
|
|
4. Cleaner early-stop handling
|
|
- Early-stopped probes no longer leave in-flight requests polluting the next probe.
|
|
- Default behavior drains in-flight requests for comparable production runs.
|
|
- Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.
|
|
|
|
5. Search-high saturation stop
|
|
- If the incumbent's highest measured probe is feasible and is within the configured binary-search resolution of `search.high`, the harness stops before asking the LLM for another proposal. Individual request failures can be present when the aggregate probe still meets the configured pass-rate SLO.
|
|
- This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.
|
|
|
|
6. Deterministic first probes
|
|
- After a baseline latency bottleneck, the harness can propose the adjacent legal TP increase before asking the LLM.
|
|
- After a TP incumbent improves per-GPU throughput, the harness keeps that topology and applies a same-topology runtime seed before trying DP/EP or broad runtime changes.
|
|
|
|
## qwen27b 0-8k Evidence
|
|
|
|
Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`.
|
|
|
|
Metric: best-so-far feasible `request_rate_per_gpu`.
|
|
|
|
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 |
|
|
| --- | ---: | ---: | ---: | ---: | ---: |
|
|
| Before harness | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 |
|
|
| After harness strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 |
|
|
|
|
Result:
|
|
|
|
- Before harness reached the best value at iter 4.
|
|
- After harness reached the same value at iter 2 and stopped at iter 3.
|
|
- Iterations-to-best improved from `4` to `2`, a `2x` convergence speedup on this case.
|
|
- The harness also avoided eight post-best infeasible runtime-only probes.
|
|
|
|
## Current Risks
|
|
|
|
- The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in `study tune`, but proposal-family blocking is not yet enforced by a separate validator.
|
|
- Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
|
|
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
|
|
|
|
## Qwen3-30B-A3B Community vLLM Evidence
|
|
|
|
Source: `docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md`.
|
|
|
|
Metric: best-so-far feasible `request_rate_per_gpu` on the bounded 0-8k chat replay with 128 output tokens and `replay_time_scale=0.1`.
|
|
|
|
Initial `search.high=0.125` result:
|
|
|
|
| Variant | Iter 1 | Iter 2 | Iter 3-12 |
|
|
| --- | ---: | ---: | ---: |
|
|
| no-harness | 1.0333 | 1.0333 | 1.0333 |
|
|
| harness | 1.0333 | 1.0333 stop | 1.0333 |
|
|
|
|
Result:
|
|
|
|
- Both variants found the same best measured config: the default community vLLM launch.
|
|
- Harness stopped at iter 2 because the incumbent saturated `search.high`; no LLM proposal or GPU trial was needed after baseline.
|
|
- No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.
|
|
- This was a measurement-ceiling result, not proof of global optimality.
|
|
|
|
High `search.high=1.0` rerun:
|
|
|
|
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5 | Iter 6 | Iter 7 | Iter 8-12 |
|
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
|
| no-harness | 2.2000 | 3.2583 | 3.2583 | 3.2583 | 3.2583 | 3.3000 | 3.3500 | 3.3500 |
|
|
| harness-guided-v2 | 2.3833 | 3.2583 | 3.2833 | 3.3000 | 3.3000 stop | 3.3000 | 3.3000 | 3.3000 |
|
|
|
|
Result:
|
|
|
|
- Raising `search.high` showed that default vLLM was not actually optimal; the prior run was capped by workload range.
|
|
- Harness reached the same TP2/runtime config family in 4 iterations instead of 7 by making deterministic first TP and same-topology runtime proposals.
|
|
- The single-run best value differs by about 1.5% (`3.3000` vs `3.3500`) for the same config family, so this should be interpreted as faster convergence to the same region, not an exact single-run throughput win.
|