# qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare ## Goal Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter TPOT SLO: - no-harness: 16 tuning iterations; - harness: 16 tuning iterations, with permission to stop early if the harness convergence guard decides no further GPU trial is needed. Both variants must be launched directly through AITuner. No state seeding, manual replay, or historical-result injection is allowed. ## Setup - Host: `dash0`. - Hardware: 8 NVIDIA H20 GPUs. - Engine: internal vLLM at `/usr/local/bin/vllm`. - Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`. - Served model name: `qwen35-27b-aituner`. - Workload window: `chat_w20260311_1000`. - Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`. - Request mode: `chat`. - Input bucket: `0 <= input_length <= 8192`. - Replay scale: `1.0`. - Max concurrency: `32`. - Max requests per probe: unset, so each probe uses the full selected trace subset for its `sampling_u` threshold. - Restart engine after early stop: `true` for both variants. This is needed under TPOT25 because very slow infeasible probes can leave live HTTP requests in the engine after the SLO is already unrecoverable. Restarting keeps the next binary-search probe from being contaminated by previous in-flight work. - Search field: `sampling_u`. - Search range: `low=0.0`, `high=0.0625`. - Search probes: `max_probes=6`, `tolerance=0.001`. - Sampling seed: `20260325`. ## SLO - Target pass rate: `0.95`. - TTFT rule: | Input tokens | TTFT threshold | | ---: | ---: | | `<=4096` | `2000 ms` | | `<=32768` | `4000 ms` | | otherwise | `6000 ms` | - TPOT rule: fixed `<=25 ms`. ## Specs Remote generated specs: - no-harness: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json` - harness: `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json` The two specs were generated from `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing `study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the only tuning-behavior difference between the formal comparison runs is whether the harness is enabled. ## Commands No-harness: ```bash PYTHONPATH=src python3 -m aituner.cli study tune \ --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \ --store-root .aituner-tight \ --max-trials 16 ``` Harness: ```bash PYTHONPATH=src python3 -m aituner.cli study tune \ --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \ --store-root .aituner-tight \ --max-trials 16 ``` ## Run Log - 2026-05-06 12:37 CST: generated both remote specs and verified that the only normalized difference is `llm.use_harness`. - 2026-05-06 12:37 CST: started no-harness in tmux session `qwen27b_tpot25_noharness_16iter_20260506`. - 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it for comparison. It used `restart_engine_after_early_stop=false`; the first TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but unfinished requests remained live in vLLM and would contaminate the next probe. - 2026-05-06 21:07 CST: generated the formal clean specs with `restart_engine_after_early_stop=true` for both variants and verified the normalized diff is still only `llm.use_harness`. - 2026-05-06 21:09 CST: started formal no-harness run in tmux session `qwen27b_tpot25_restart_noharness_16iter_20260506`. ## Results Pending.