3.5 KiB
3.5 KiB
qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare
Goal
Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter TPOT SLO:
- no-harness: 16 tuning iterations;
- harness: 16 tuning iterations, with permission to stop early if the harness convergence guard decides no further GPU trial is needed.
Both variants must be launched directly through AITuner. No state seeding, manual replay, or historical-result injection is allowed.
Setup
- Host:
dash0. - Hardware: 8 NVIDIA H20 GPUs.
- Engine: internal vLLM at
/usr/local/bin/vllm. - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal. - Served model name:
qwen35-27b-aituner. - Workload window:
chat_w20260311_1000. - Trace path source:
/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json. - Request mode:
chat. - Input bucket:
0 <= input_length <= 8192. - Replay scale:
1.0. - Max concurrency:
32. - Max requests per probe: unset, so each probe uses the full selected trace
subset for its
sampling_uthreshold. - Restart engine after early stop:
truefor both variants. This is needed under TPOT25 because very slow infeasible probes can leave live HTTP requests in the engine after the SLO is already unrecoverable. Restarting keeps the next binary-search probe from being contaminated by previous in-flight work. - Search field:
sampling_u. - Search range:
low=0.0,high=0.0625. - Search probes:
max_probes=6,tolerance=0.001. - Sampling seed:
20260325.
SLO
- Target pass rate:
0.95. - TTFT rule:
| Input tokens | TTFT threshold |
|---|---|
<=4096 |
2000 ms |
<=32768 |
4000 ms |
| otherwise | 6000 ms |
- TPOT rule: fixed
<=25 ms.
Specs
Remote generated specs:
- no-harness:
.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json - harness:
.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json
The two specs were generated from
configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json. After normalizing
study_id and llm.use_harness, the JSON payloads compare equal. Therefore the
only tuning-behavior difference between the formal comparison runs is whether
the harness is enabled.
Commands
No-harness:
PYTHONPATH=src python3 -m aituner.cli study tune \
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
--store-root .aituner-tight \
--max-trials 16
Harness:
PYTHONPATH=src python3 -m aituner.cli study tune \
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
--store-root .aituner-tight \
--max-trials 16
Run Log
- 2026-05-06 12:37 CST: generated both remote specs and verified that the only
normalized difference is
llm.use_harness. - 2026-05-06 12:37 CST: started no-harness in tmux session
qwen27b_tpot25_noharness_16iter_20260506. - 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it
for comparison. It used
restart_engine_after_early_stop=false; the first TP1 baseline probe already recordedslo_pass_rate_unrecoverable, but unfinished requests remained live in vLLM and would contaminate the next probe. - 2026-05-06 21:07 CST: generated the formal clean specs with
restart_engine_after_early_stop=truefor both variants and verified the normalized diff is still onlyllm.use_harness. - 2026-05-06 21:09 CST: started formal no-harness run in tmux session
qwen27b_tpot25_restart_noharness_16iter_20260506.
Results
Pending.