AITuner Harness Summary

What The Harness Adds

The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.

Workload profile
- Extracts L-C-A features from the trace window:
  - L: prompt length percentiles and tail ratio.
  - C: prefix/cache reuse estimates from hash_ids when available.
  - A: request rate, burst ratio, and interarrival variation.
- These features are injected into the prompt as a structured Harnesses section.
Trial diagnostics
- Reads recent trial result JSON.
- Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts.
- Classifies the active bottleneck as ttft_prefill, decode_tpot, admission_or_queueing, launch_or_memory, or unknown.
Knob-family harnesses
- Maps bottlenecks to a small number of plausible knob families.
- Current harness families:
  - tensor-parallel-size: long-prompt TTFT/prefill bottlenecks.
  - max-num-batched-tokens: prefill batching or fragmentation, with trust-region guards.
  - max-num-seqs: cache-heavy or admission-limited workloads.
  - enable-chunked-prefill: long-tail prompt blocking.
  - gpu-memory-utilization: memory headroom after topology and batching are stable.
- Each family has use_when, procedure, guards, and active_now fields.
Proposal discipline
- The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
- It must use adjacent legal topology choices and stay inside topology constraints.
- It receives tested config signatures, so it should not repeat already-tried configs.
- It can return should_stop=true when no adjacent harness-guided probe is justified.
Baseline-first loop
- LLM-driven study tune now evaluates the initial engine config first unless --skip-baseline is passed.
- This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config.

What Accelerates Convergence

The speedup comes from reducing wasted proposal families, not from changing the benchmark metric.

Topology-before-runtime on prefill bottlenecks
- For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
- Example: qwen27b 0-8k chat reached TP=2, DP=1 at iter 2 under harness replay, while the original run spent iter 2 on DP=2 and iter 3 on DP=4.
Guarded stop after a strong incumbent
- If the newest trial is the incumbent and improves per-GPU throughput by at least 1.8x over baseline, the harness requires direct evidence before trying runtime-only tweaks.
- Without that guard, the LLM still proposed weak MBT trials after finding the qwen27b best config.
- With the guard, it emits should_stop=true.
All-infeasible plateau detection
- When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
- This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau.
Cleaner early-stop handling
- Early-stopped probes no longer leave in-flight requests polluting the next probe.
- Default behavior drains in-flight requests for comparable production runs.
- Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.

qwen27b 0-8k Evidence

Source: docs/qwen27b-chat-0-8k-harness-fig18.md.

Metric: best-so-far feasible request_rate_per_gpu.

Variant	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5-12
Before harness	0.0350	0.0617	0.0617	0.2025	0.2025
After harness strict replay	0.0350	0.2025	0.2025 stop	0.2025	0.2025

Result:

Before harness reached the best value at iter 4.
After harness reached the same value at iter 2 and stopped at iter 3.
Iterations-to-best improved from 4 to 2, a 2x convergence speedup on this case.
The harness also avoided eight post-best infeasible runtime-only probes.

Current Risks

The harness is prompt-guided, not a hard verifier for every rule. If future LLM outputs ignore a fired guard, proposal validation should reject the blocked family explicitly.
Strong-incumbent stopping is intentionally biased toward fewer GPU trials once a large gain is already reached. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.

4.9 KiB Raw Blame History

AITuner Harness Summary

What The Harness Adds

What Accelerates Convergence

qwen27b 0-8k Evidence

Current Risks

4.9 KiB

Raw Blame History