6.5 KiB
AITuner Harness Summary
What The Harness Adds
The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.
-
Workload profile
- Extracts L-C-A features from the trace window:
- L: prompt length percentiles and tail ratio.
- C: prefix/cache reuse estimates from
hash_idswhen available. - A: request rate, burst ratio, and interarrival variation.
- These features are injected into the prompt as a structured
Harnessessection.
- Extracts L-C-A features from the trace window:
-
Trial diagnostics
- Reads recent trial result JSON.
- Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts.
- Classifies the active bottleneck as
ttft_prefill,decode_tpot,admission_or_queueing,launch_or_memory, or unknown.
-
Knob-family harnesses
- Maps bottlenecks to a small number of plausible knob families.
- Current harness families:
tensor-parallel-size: long-prompt TTFT/prefill bottlenecks.max-num-batched-tokens: prefill batching or fragmentation, with trust-region guards.max-num-seqs: cache-heavy or admission-limited workloads.enable-chunked-prefill: long-tail prompt blocking.gpu-memory-utilization: memory headroom after topology and batching are stable.
- Each family has
use_when,procedure,guards, andactive_nowfields.
-
Proposal discipline and early stop
- The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
- It must use adjacent legal topology choices and stay inside topology constraints.
- It receives tested config signatures, so it should not repeat already-tried configs.
- A deterministic harness stop can now emit
should_stop=truebefore calling the LLM when completed validation evidence says another trial is not justified.
-
Baseline-first loop
- LLM-driven
study tunenow evaluates the initial engine config first unless--skip-baselineis passed. - This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config.
- LLM-driven
What Accelerates Convergence
The speedup comes from reducing wasted proposal families, not from changing the benchmark metric.
-
Topology-before-runtime on prefill bottlenecks
- For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
- Example: qwen27b 0-8k chat reached
TP=2, DP=1at iter 2 under harness replay, while the original run spent iter 2 onDP=2and iter 3 onDP=4.
-
Guarded stop after validation, not immediately after a strong incumbent
- If the newest trial is the incumbent and improves per-GPU throughput by at least
1.8xover baseline, the harness requires direct evidence before trying runtime-only tweaks. - It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement.
- With the guard,
study tunecan write aharness-stop-XXXXproposal and exit without spending another GPU trial.
- If the newest trial is the incumbent and improves per-GPU throughput by at least
-
All-infeasible plateau detection
- When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
- This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau.
- Plateau alone does not trigger deterministic early stop; it forces either a different justified family or a later validation/convergence stop.
-
Cleaner early-stop handling
- Early-stopped probes no longer leave in-flight requests polluting the next probe.
- Default behavior drains in-flight requests for comparable production runs.
- Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.
-
Search-high saturation stop
- If the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of
search.high, the harness stops before asking the LLM for another proposal. - This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.
- If the incumbent's highest measured probe is feasible, has no SLO failures, and is within the configured binary-search resolution of
qwen27b 0-8k Evidence
Source: docs/qwen27b-chat-0-8k-harness-fig18.md.
Metric: best-so-far feasible request_rate_per_gpu.
| Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 |
|---|---|---|---|---|---|
| Before harness | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 |
| After harness strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 |
Result:
- Before harness reached the best value at iter 4.
- After harness reached the same value at iter 2 and stopped at iter 3.
- Iterations-to-best improved from
4to2, a2xconvergence speedup on this case. - The harness also avoided eight post-best infeasible runtime-only probes.
Current Risks
- The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in
study tune, but proposal-family blocking is not yet enforced by a separate validator. - Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
- Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.
Qwen3-30B-A3B Community vLLM Evidence
Source: docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md.
Metric: best-so-far feasible request_rate_per_gpu on the bounded 0-8k chat replay with 128 output tokens and replay_time_scale=0.1.
| Variant | Iter 1 | Iter 2 | Iter 3-12 |
|---|---|---|---|
| no-harness | 1.0333 | 1.0333 | 1.0333 |
| harness | 1.0333 | 1.0333 stop | 1.0333 |
Result:
- Both variants found the same best measured config: the default community vLLM launch.
- Harness stopped at iter 2 because the incumbent saturated
search.high; no LLM proposal or GPU trial was needed after baseline. - No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.