diff --git a/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md b/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md new file mode 100644 index 0000000..c2a5af7 --- /dev/null +++ b/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md @@ -0,0 +1,106 @@ +# qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare + +## Goal + +Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter +TPOT SLO: + +- no-harness: 16 tuning iterations; +- harness: 16 tuning iterations, with permission to stop early if the harness + convergence guard decides no further GPU trial is needed. + +Both variants must be launched directly through AITuner. No state seeding, +manual replay, or historical-result injection is allowed. + +## Setup + +- Host: `dash0`. +- Hardware: 8 NVIDIA H20 GPUs. +- Engine: internal vLLM at `/usr/local/bin/vllm`. +- Model: + `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`. +- Served model name: `qwen35-27b-aituner`. +- Workload window: `chat_w20260311_1000`. +- Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`. +- Request mode: `chat`. +- Input bucket: `0 <= input_length <= 8192`. +- Replay scale: `1.0`. +- Max concurrency: `32`. +- Max requests per probe: unset, so each probe uses the full selected trace + subset for its `sampling_u` threshold. +- Restart engine after early stop: `true` for both variants. This is needed + under TPOT25 because very slow infeasible probes can leave live HTTP requests + in the engine after the SLO is already unrecoverable. Restarting keeps the + next binary-search probe from being contaminated by previous in-flight work. +- Search field: `sampling_u`. +- Search range: `low=0.0`, `high=0.0625`. +- Search probes: `max_probes=6`, `tolerance=0.001`. +- Sampling seed: `20260325`. + +## SLO + +- Target pass rate: `0.95`. +- TTFT rule: + +| Input tokens | TTFT threshold | +| ---: | ---: | +| `<=4096` | `2000 ms` | +| `<=32768` | `4000 ms` | +| otherwise | `6000 ms` | + +- TPOT rule: fixed `<=25 ms`. + +## Specs + +Remote generated specs: + +- no-harness: + `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json` +- harness: + `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json` + +The two specs were generated from +`configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing +`study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the +only tuning-behavior difference between the formal comparison runs is whether +the harness is enabled. + +## Commands + +No-harness: + +```bash +PYTHONPATH=src python3 -m aituner.cli study tune \ + --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \ + --store-root .aituner-tight \ + --max-trials 16 +``` + +Harness: + +```bash +PYTHONPATH=src python3 -m aituner.cli study tune \ + --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \ + --store-root .aituner-tight \ + --max-trials 16 +``` + +## Run Log + +- 2026-05-06 12:37 CST: generated both remote specs and verified that the only + normalized difference is `llm.use_harness`. +- 2026-05-06 12:37 CST: started no-harness in tmux session + `qwen27b_tpot25_noharness_16iter_20260506`. +- 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it + for comparison. It used `restart_engine_after_early_stop=false`; the first + TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but + unfinished requests remained live in vLLM and would contaminate the next probe. +- 2026-05-06 21:07 CST: generated the formal clean specs with + `restart_engine_after_early_stop=true` for both variants and verified the + normalized diff is still only `llm.use_harness`. +- 2026-05-06 21:09 CST: started formal no-harness run in tmux session + `qwen27b_tpot25_restart_noharness_16iter_20260506`. + +## Results + +Pending. diff --git a/paper.pdf b/paper.pdf new file mode 100644 index 0000000..a25fbe6 Binary files /dev/null and b/paper.pdf differ