# AITuner Harness Summary ## What The Harness Adds The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision. 1. Workload profile - Extracts L-C-A features from the trace window: - L: prompt length percentiles and tail ratio. - C: prefix/cache reuse estimates from `hash_ids` when available. - A: request rate, burst ratio, and interarrival variation. - These features are injected into the prompt as a structured `Harnesses` section. 2. Trial diagnostics - Reads recent trial result JSON. - Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts. - Classifies the active bottleneck as `ttft_prefill`, `decode_tpot`, `admission_or_queueing`, `launch_or_memory`, or unknown. 3. Knob-family harnesses - Maps bottlenecks to a small number of plausible knob families. - Current harness families: - `tensor-parallel-size`: long-prompt TTFT/prefill bottlenecks. - `max-num-batched-tokens`: prefill batching or fragmentation, with trust-region guards. - `max-num-seqs`: cache-heavy or admission-limited workloads. - `enable-chunked-prefill`: long-tail prompt blocking. - `gpu-memory-utilization`: memory headroom after topology and batching are stable. - Each family has `use_when`, `procedure`, `guards`, and `active_now` fields. 4. Proposal discipline and early stop - The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed. - It must use adjacent legal topology choices and stay inside topology constraints. - It receives tested config signatures, so it should not repeat already-tried configs. - A deterministic harness stop can now emit `should_stop=true` before calling the LLM when completed validation evidence says another trial is not justified. 5. Baseline-first loop - LLM-driven `study tune` now evaluates the initial engine config first unless `--skip-baseline` is passed. - This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config. ## What Accelerates Convergence The speedup comes from reducing wasted proposal families, not from changing the benchmark metric. 1. Topology-before-runtime on prefill bottlenecks - For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs. - Example: qwen27b 0-8k chat reached `TP=2, DP=1` at iter 2 under harness replay, while the original run spent iter 2 on `DP=2` and iter 3 on `DP=4`. 2. Guarded stop after validation, not immediately after a strong incumbent - If the newest trial is the incumbent and improves per-GPU throughput by at least `1.8x` over baseline, the harness requires direct evidence before trying runtime-only tweaks. - It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement. - With the guard, `study tune` can write a `harness-stop-XXXX` proposal and exit without spending another GPU trial. 3. All-infeasible plateau detection - When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family. - This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau. - Plateau alone does not trigger deterministic early stop; it forces either a different justified family or a later validation/convergence stop. 4. Cleaner early-stop handling - Early-stopped probes no longer leave in-flight requests polluting the next probe. - Default behavior drains in-flight requests for comparable production runs. - Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability. ## qwen27b 0-8k Evidence Source: `docs/qwen27b-chat-0-8k-harness-fig18.md`. Metric: best-so-far feasible `request_rate_per_gpu`. | Variant | Iter 1 | Iter 2 | Iter 3 | Iter 4 | Iter 5-12 | | --- | ---: | ---: | ---: | ---: | ---: | | Before harness | 0.0350 | 0.0617 | 0.0617 | 0.2025 | 0.2025 | | After harness strict replay | 0.0350 | 0.2025 | 0.2025 stop | 0.2025 | 0.2025 | Result: - Before harness reached the best value at iter 4. - After harness reached the same value at iter 2 and stopped at iter 3. - Iterations-to-best improved from `4` to `2`, a `2x` convergence speedup on this case. - The harness also avoided eight post-best infeasible runtime-only probes. ## Current Risks - The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in `study tune`, but proposal-family blocking is not yet enforced by a separate validator. - Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config. - Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.