gahow/aituner

Fork 0

Files

Gahow Wang adc4351e5d Report latency stats for infeasible baseline

2026-05-08 11:10:34 +08:00

5.5 KiB

Raw Permalink Blame History

Qwen27B Chat 0-8k TPOT 40ms Baseline Infeasible Run

Date: 2026-05-07

Goal

Re-run the internal vLLM + Qwen3.5-27B chat 0-8k tuning comparison after adding a study-level guard:

if the automatic baseline trial has no feasible probe;
and the lowest sampled request rate still fails the SLO target pass rate;
then AITuner stops the whole study and reports that the SLO is too tight for the current setup.

This prevents spending the remaining tuning budget on LLM or harness proposals when the baseline itself demonstrates that the workload/SLO is infeasible at the search floor.

Implementation

Commit: f212673 Stop tuning when baseline is infeasible

Changed behavior:

study tune now persists tuning_stop_reason and tuning_stop_diagnosis in state.json.
study tune also persists tuning_stop_details, including the lowest sampled probe's TTFT/TPOT mean, p50, p95, and p99.
After the automatic baseline trial is ingested, AITuner checks the worker result:
- status == completed
- best_request_rate is None
- at least one probe exists
- all probes are infeasible
If true, AITuner stops before asking the LLM or harness for any proposal.
Re-running the same study respects the persisted stop state and does not resume tuning.

Validation:

python3 -m compileall -q src tests
PYTHONPATH=src python3 -m unittest tests.test_core_flow

Local and dash0 both passed.

Setup

Host: dash0

Remote repo: /home/admin/cpfs/wjh/aituner/aituner

Base spec: configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json

Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal

Workload: chat, 0-8k input window

SLO:

TTFT: existing step rule from the base spec
TPOT: fixed 40ms
target pass rate: 0.95

Search:

Direct AITuner command: python3 -m aituner.cli study tune ... --max-trials 12
No manual proposal/state edits during either run.
Both variants used CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7; this was identical for both specs.
The two specs were verified equal after normalizing only study_id and llm.use_harness.

Specs:

no-harness: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot40-gpu3skip-12iter-noharness-20260507.json
harness: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot40-gpu3skip-12iter-harness-20260507.json

Commands

No harness:

PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot40-gpu3skip-12iter-noharness-20260507.json \
  --store-root .aituner-tight \
  --max-trials 12

Harness:

PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot40-gpu3skip-12iter-harness-20260507.json \
  --store-root .aituner-tight \
  --max-trials 12

Results

Both runs stopped after the baseline trial. No LLM/harness proposal was evaluated because baseline had no feasible probe.

Variant	Trials executed	Best request rate	Best request rate / GPU	Stop reason
no-harness	1	-	-	`baseline_all_infeasible`
harness	1	-	-	`baseline_all_infeasible`

Baseline probe curve:

sampling_u	request rate	pass rate	feasible	early stop reason
0.03125	0.895	0.000000	false	`slo_pass_rate_unrecoverable`
0.015625	0.483333	0.137931	false	`slo_pass_rate_unrecoverable`
0.0078125	0.246667	0.236486	false	`slo_pass_rate_unrecoverable`
0.00390625	0.123333	0.189189	false	`slo_pass_rate_unrecoverable`
0.001953125	0.065000	0.205128	false	`slo_pass_rate_unrecoverable`
0.0009765625	0.035000	0.142857	false	`slo_pass_rate_unrecoverable`

Lowest request rate latency summary:

Variant	request rate	pass rate	TTFT mean	TTFT p50	TTFT p95	TTFT p99	TPOT mean	TPOT p50	TPOT p95	TPOT p99
no-harness	0.035000	0.142857	1288.953ms	446.586ms	3011.814ms	3011.814ms	12.661ms	13.141ms	15.097ms	15.097ms
harness	0.035000	0.142857	1268.090ms	445.274ms	2889.080ms	2889.080ms	12.658ms	13.170ms	15.102ms	15.102ms

This shows that the TPOT threshold of 40ms is not the binding constraint at the lowest sampled rate. The observed TPOT p99 is about 15.1ms; failures are driven by TTFT and by the unrecoverable-pass-rate early stop after too many requests have already failed or been skipped.

Final diagnosis written by AITuner:

Baseline configuration has no feasible probe under the current SLO. Stopping tuning because even the lowest sampled request rate did not meet the target pass rate. lowest_sampled_request_rate=0.035 lowest_sampling_u=0.000976562 lowest_probe_pass_rate=0.142857 early_stop_reason=slo_pass_rate_unrecoverable

Interpretation

This run does not measure harness acceleration. It proves that the TPOT 40ms setup is infeasible for the current baseline and search floor: even at 0.035 aggregate request rate, only 14.29% of requests pass the SLO, far below the 95% target.

The correct behavior is to stop the study early and report SLO infeasibility instead of spending the remaining 11 trial slots. Harness cannot accelerate convergence when there is no feasible baseline point and no incumbent for guided tuning.

For a Fig. 18-style convergence comparison, the next setup must first have at least one feasible baseline or feasible low-rate point under the same metric definitions.

5.5 KiB Raw Permalink Blame History