Files
aituner/docs/aituner-harness-summary.md
Gahow Wang d7df1ebdac
Some checks failed
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add open source project metadata
2026-05-06 21:18:21 +08:00

7.9 KiB

AITuner Harness Summary

What The Harness Adds

The harness turns each LLM proposal from open-ended config search into a bottleneck-directed decision.

  1. Workload profile

    • Extracts L-C-A features from the trace window:
      • L: prompt length percentiles and tail ratio.
      • C: prefix/cache reuse estimates from hash_ids when available.
      • A: request rate, burst ratio, and interarrival variation.
    • These features are injected into the prompt as a structured Harnesses section.
  2. Trial diagnostics

    • Reads recent trial result JSON.
    • Summarizes feasible probes, all-infeasible probes, pass rates, request rates, latency percentiles, and failed SLO reason counts.
    • Classifies the active bottleneck as ttft_prefill, decode_tpot, admission_or_queueing, launch_or_memory, or unknown.
  3. Knob-family harnesses

    • Maps bottlenecks to a small number of plausible knob families.
    • Current harness families:
      • tensor-parallel-size: long-prompt TTFT/prefill bottlenecks.
      • max-num-batched-tokens: prefill batching or fragmentation, with trust-region guards.
      • max-num-seqs: cache-heavy or admission-limited workloads.
      • enable-chunked-prefill: long-tail prompt blocking.
      • gpu-memory-utilization: memory headroom after topology and batching are stable.
    • Each family has use_when, procedure, guards, and active_now fields.
  4. Proposal discipline and early stop

    • The prompt requires the LLM to choose at most one primary knob family unless history proves a coupled change is needed.
    • It must use adjacent legal topology choices and stay inside topology constraints.
    • It receives tested config signatures, so it should not repeat already-tried configs.
    • A deterministic harness stop can now emit should_stop=true before calling the LLM when completed validation evidence says another trial is not justified.
  5. Baseline-first loop

    • LLM-driven study tune now evaluates the initial engine config first unless --skip-baseline is passed.
    • This aligns the loop with evaluate-then-search: the first LLM proposal sees measured bottleneck evidence rather than guessing from static config.

What Accelerates Convergence

The speedup comes from reducing wasted proposal families, not from changing the benchmark metric.

  1. Topology-before-runtime on prefill bottlenecks

    • For long-prompt, low-cache-reuse windows, the harness activates the TP harness before speculative runtime knobs.
    • Example: qwen27b 0-8k chat reached TP=2, DP=1 at iter 2 under harness replay, while the original run spent iter 2 on DP=2 and iter 3 on DP=4.
  2. Guarded stop after validation, not immediately after a strong incumbent

    • If the newest trial is the incumbent and improves per-GPU throughput by at least 1.8x over baseline, the harness requires direct evidence before trying runtime-only tweaks.
    • It does not stop at the first large gain. It requires post-incumbent validation trials across nearby topology/runtime families, and stops only if those trials fail to produce a feasible per-GPU improvement.
    • With the guard, study tune can write a harness-stop-XXXX proposal and exit without spending another GPU trial.
  3. All-infeasible plateau detection

    • When recent all-infeasible trials at the same sampling threshold stop improving pass rate and p95 TTFT, the harness blocks repeating the same primary knob family.
    • This prevents continuing a direction such as DP-only scale-out after DP4 and DP8 plateau.
    • Plateau alone does not trigger deterministic early stop; it forces either a different justified family or a later validation/convergence stop.
  4. Cleaner early-stop handling

    • Early-stopped probes no longer leave in-flight requests polluting the next probe.
    • Default behavior drains in-flight requests for comparable production runs.
    • Engine relaunch after early stop is available as opt-in for faster smoke studies, but it is not the default because it can change warm-state comparability.
  5. Search-high saturation stop

    • If the incumbent's highest measured probe is feasible and is within the configured binary-search resolution of search.high, the harness stops before asking the LLM for another proposal. Individual request failures can be present when the aggregate probe still meets the configured pass-rate SLO.
    • This is not a model-specific threshold. It means the workload search range, not the engine config, is currently the limiting measurement bound.
  6. Deterministic first probes

    • After a baseline latency bottleneck, the harness can propose the adjacent legal TP increase before asking the LLM.
    • After a TP incumbent improves per-GPU throughput, the harness keeps that topology and applies a same-topology runtime seed before trying DP/EP or broad runtime changes.

qwen27b 0-8k Evidence

Source: docs/qwen27b-chat-0-8k-harness-fig18.md.

Metric: best-so-far feasible request_rate_per_gpu.

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5-12
Before harness 0.0350 0.0617 0.0617 0.2025 0.2025
After harness strict replay 0.0350 0.2025 0.2025 stop 0.2025 0.2025

Result:

  • Before harness reached the best value at iter 4.
  • After harness reached the same value at iter 2 and stopped at iter 3.
  • Iterations-to-best improved from 4 to 2, a 2x convergence speedup on this case.
  • The harness also avoided eight post-best infeasible runtime-only probes.

Current Risks

  • The harness is still prompt-guided for choosing the next non-stop proposal. The deterministic stop path is hard-coded in study tune, but proposal-family blocking is not yet enforced by a separate validator.
  • Strong-incumbent stopping is intentionally biased toward fewer GPU trials after validation evidence accumulates. Workloads with very narrow runtime sweet spots may still need a "continue local refinement" exception when the user wants absolute best throughput rather than fastest convergence to a good config.
  • Full fresh reruns on large models are expensive. Strict replay is useful for measuring proposal-path improvements when the proposed configs already exist in prior measured runs, but publication-quality claims still need fresh no-relaunch runs when time allows.

Qwen3-30B-A3B Community vLLM Evidence

Source: docs/qwen30b-community-vllm020/harness-early-stop-ablation-20260502.md.

Metric: best-so-far feasible request_rate_per_gpu on the bounded 0-8k chat replay with 128 output tokens and replay_time_scale=0.1.

Initial search.high=0.125 result:

Variant Iter 1 Iter 2 Iter 3-12
no-harness 1.0333 1.0333 1.0333
harness 1.0333 1.0333 stop 1.0333

Result:

  • Both variants found the same best measured config: the default community vLLM launch.
  • Harness stopped at iter 2 because the incumbent saturated search.high; no LLM proposal or GPU trial was needed after baseline.
  • No-harness spent the full 12-iteration budget: iter 2 was worse per GPU, and iter 3-12 were launch failures.
  • This was a measurement-ceiling result, not proof of global optimality.

High search.high=1.0 rerun:

Variant Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 Iter 7 Iter 8-12
no-harness 2.2000 3.2583 3.2583 3.2583 3.2583 3.3000 3.3500 3.3500
harness-guided-v2 2.3833 3.2583 3.2833 3.3000 3.3000 stop 3.3000 3.3000 3.3000

Result:

  • Raising search.high showed that default vLLM was not actually optimal; the prior run was capped by workload range.
  • Harness reached the same TP2/runtime config family in 4 iterations instead of 7 by making deterministic first TP and same-topology runtime proposals.
  • The single-run best value differs by about 1.5% (3.3000 vs 3.3500) for the same config family, so this should be interpreted as faster convergence to the same region, not an exact single-run throughput win.