Gahow Wang 77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
2026-06-15 16:25:24 +08:00
2026-06-15 16:25:24 +08:00
2026-04-13 20:50:39 +08:00
2026-04-25 16:18:28 +08:00
2026-05-06 21:18:21 +08:00
2026-05-06 21:18:21 +08:00
2026-05-06 21:18:21 +08:00

AITuner

AITuner is a small study orchestrator for OpenAI-compatible serving engines. It replays trace windows, searches for the highest feasible offered load under configured SLOs, and records enough trial context for LLM- or harness-guided configuration proposals.

Status

This repository is research tooling. Treat reported experiment numbers as valid only when the matching study spec, trial artifacts, probe history, and probe_details.jsonl files are available for audit.

Install

python3 -m pip install -e .

Test

The test suite uses the Python standard library unittest runner:

PYTHONPATH=src python3 -m unittest discover -s tests -v

If the package is installed in editable mode, PYTHONPATH=src is optional.

Basic Workflow

Initialize a study:

aituner study init --spec configs/examples/study.example.json

Run a local tuning loop:

aituner study tune --spec configs/examples/study.example.json --max-trials 2

Run a compare:

aituner compare run --spec configs/examples/compare.example.json

Remote experiment notes for this checkout live in AGENTS.md. The default remote host is dash0, and code should be synchronized through Git before remote runs.

Experiment Integrity

  • Fixed-length replay requests are scored only when completion token usage is verifiable and matches the trace expectation.
  • Each trial writes aggregate probe history and per-request probe details.
  • request_rate_per_gpu is the primary cross-topology metric: best_feasible_request_rate / (tensor_parallel_size * data_parallel_size).
  • Compare reports include failed and no-feasible window counts; do not interpret mean request rates without those counts.
  • Bounded replays using max_requests_per_probe, completion_tokens_override, or replay_time_scale are convergence tests for that bounded workload, not production benchmarks.

Configuration Notes

Example specs that use llm.endpoint.provider=codex resolve the endpoint from the local Codex configuration unless llm.endpoint.base_url or AITUNER_CODEX_BASE_URL is set. Public, reproducible examples should prefer an explicit endpoint or omit the LLM endpoint and use proposal files.

Description
No description provided
Readme MIT 6.6 MiB
Languages
Python 98.4%
Shell 1.6%