Files

Gahow Wang 8b4116fad0 Add reference paper and qwen27b tpot25 16-iter notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 14:02:30 +08:00

3.5 KiB

Raw Blame History

qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare

Goal

Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter TPOT SLO:

no-harness: 16 tuning iterations;
harness: 16 tuning iterations, with permission to stop early if the harness convergence guard decides no further GPU trial is needed.

Both variants must be launched directly through AITuner. No state seeding, manual replay, or historical-result injection is allowed.

Setup

Host: dash0.
Hardware: 8 NVIDIA H20 GPUs.
Engine: internal vLLM at /usr/local/bin/vllm.
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal.
Served model name: qwen35-27b-aituner.
Workload window: chat_w20260311_1000.
Trace path source: /home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json.
Request mode: chat.
Input bucket: 0 <= input_length <= 8192.
Replay scale: 1.0.
Max concurrency: 32.
Max requests per probe: unset, so each probe uses the full selected trace subset for its sampling_u threshold.
Restart engine after early stop: true for both variants. This is needed under TPOT25 because very slow infeasible probes can leave live HTTP requests in the engine after the SLO is already unrecoverable. Restarting keeps the next binary-search probe from being contaminated by previous in-flight work.
Search field: sampling_u.
Search range: low=0.0, high=0.0625.
Search probes: max_probes=6, tolerance=0.001.
Sampling seed: 20260325.

SLO

Target pass rate: 0.95.
TTFT rule:

Input tokens	TTFT threshold
`<=4096`	`2000 ms`
`<=32768`	`4000 ms`
otherwise	`6000 ms`

TPOT rule: fixed <=25 ms.

Specs

Remote generated specs:

no-harness: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json
harness: .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json

The two specs were generated from configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json. After normalizing study_id and llm.use_harness, the JSON payloads compare equal. Therefore the only tuning-behavior difference between the formal comparison runs is whether the harness is enabled.

Commands

No-harness:

PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
  --store-root .aituner-tight \
  --max-trials 16

Harness:

PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
  --store-root .aituner-tight \
  --max-trials 16

Run Log

2026-05-06 12:37 CST: generated both remote specs and verified that the only normalized difference is llm.use_harness.
2026-05-06 12:37 CST: started no-harness in tmux session qwen27b_tpot25_noharness_16iter_20260506.
2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it for comparison. It used restart_engine_after_early_stop=false; the first TP1 baseline probe already recorded slo_pass_rate_unrecoverable, but unfinished requests remained live in vLLM and would contaminate the next probe.
2026-05-06 21:07 CST: generated the formal clean specs with restart_engine_after_early_stop=true for both variants and verified the normalized diff is still only llm.use_harness.
2026-05-06 21:09 CST: started formal no-harness run in tmux session qwen27b_tpot25_restart_noharness_16iter_20260506.

Results

Pending.

3.5 KiB Raw Blame History