The harness defined a gpu-memory-utilization family but hard-coded active_now=False and never generated a candidate for it, and only ever *lowered* max-num-seqs for decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873 (+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real coverage gap, not bad luck. Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology has moved off the baseline, so a baseline latency bottleneck still gets a TP change): - Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block a premature Stop-B until it is tried; the incumbent guard keeps the step only if per-GPU rate improves and the engine launches, and the tested signature terminates the climb (so 0.96 OOM/regression backs off to 0.94 automatically). - Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism. - Activate the gpu-memory-utilization harness family for decode_tpot/admission. Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass. Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
AITuner
AITuner is a small study orchestrator for OpenAI-compatible serving engines. It replays trace windows, searches for the highest feasible offered load under configured SLOs, and records enough trial context for LLM- or harness-guided configuration proposals.
Status
This repository is research tooling. Treat reported experiment numbers as valid
only when the matching study spec, trial artifacts, probe history, and
probe_details.jsonl files are available for audit.
Install
python3 -m pip install -e .
Test
The test suite uses the Python standard library unittest runner:
PYTHONPATH=src python3 -m unittest discover -s tests -v
If the package is installed in editable mode, PYTHONPATH=src is optional.
Basic Workflow
Initialize a study:
aituner study init --spec configs/examples/study.example.json
Run a local tuning loop:
aituner study tune --spec configs/examples/study.example.json --max-trials 2
Run a compare:
aituner compare run --spec configs/examples/compare.example.json
Remote experiment notes for this checkout live in AGENTS.md. The default
remote host is dash0, and code should be synchronized through Git before
remote runs.
Experiment Integrity
- Fixed-length replay requests are scored only when completion token usage is verifiable and matches the trace expectation.
- Each trial writes aggregate probe history and per-request probe details.
request_rate_per_gpuis the primary cross-topology metric:best_feasible_request_rate / (tensor_parallel_size * data_parallel_size).- Compare reports include failed and no-feasible window counts; do not interpret mean request rates without those counts.
- Bounded replays using
max_requests_per_probe,completion_tokens_override, orreplay_time_scaleare convergence tests for that bounded workload, not production benchmarks.
Configuration Notes
Example specs that use llm.endpoint.provider=codex resolve the endpoint from
the local Codex configuration unless llm.endpoint.base_url or
AITUNER_CODEX_BASE_URL is set. Public, reproducible examples should prefer an
explicit endpoint or omit the LLM endpoint and use proposal files.