Add reference paper and qwen27b tpot25 16-iter notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
parent 27d1c8fa92
commit 8b4116fad0
2 changed files with 106 additions and 0 deletions
--- a/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
+++ b/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
@@ -0,0 +1,106 @@
+# qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare
+
+## Goal
+
+Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter
+TPOT SLO:
+
+- no-harness: 16 tuning iterations;
+- harness: 16 tuning iterations, with permission to stop early if the harness
+  convergence guard decides no further GPU trial is needed.
+
+Both variants must be launched directly through AITuner. No state seeding,
+manual replay, or historical-result injection is allowed.
+
+## Setup
+
+- Host: `dash0`.
+- Hardware: 8 NVIDIA H20 GPUs.
+- Engine: internal vLLM at `/usr/local/bin/vllm`.
+- Model:
+  `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
+- Served model name: `qwen35-27b-aituner`.
+- Workload window: `chat_w20260311_1000`.
+- Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`.
+- Request mode: `chat`.
+- Input bucket: `0 <= input_length <= 8192`.
+- Replay scale: `1.0`.
+- Max concurrency: `32`.
+- Max requests per probe: unset, so each probe uses the full selected trace
+  subset for its `sampling_u` threshold.
+- Restart engine after early stop: `true` for both variants. This is needed
+  under TPOT25 because very slow infeasible probes can leave live HTTP requests
+  in the engine after the SLO is already unrecoverable. Restarting keeps the
+  next binary-search probe from being contaminated by previous in-flight work.
+- Search field: `sampling_u`.
+- Search range: `low=0.0`, `high=0.0625`.
+- Search probes: `max_probes=6`, `tolerance=0.001`.
+- Sampling seed: `20260325`.
+
+## SLO
+
+- Target pass rate: `0.95`.
+- TTFT rule:
+
+| Input tokens | TTFT threshold |
+| ---: | ---: |
+| `<=4096` | `2000 ms` |
+| `<=32768` | `4000 ms` |
+| otherwise | `6000 ms` |
+
+- TPOT rule: fixed `<=25 ms`.
+
+## Specs
+
+Remote generated specs:
+
+- no-harness:
+  `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json`
+- harness:
+  `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json`
+
+The two specs were generated from
+`configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing
+`study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the
+only tuning-behavior difference between the formal comparison runs is whether
+the harness is enabled.
+
+## Commands
+
+No-harness:
+
+```bash
+PYTHONPATH=src python3 -m aituner.cli study tune \
+  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
+  --store-root .aituner-tight \
+  --max-trials 16
+```
+
+Harness:
+
+```bash
+PYTHONPATH=src python3 -m aituner.cli study tune \
+  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
+  --store-root .aituner-tight \
+  --max-trials 16
+```
+
+## Run Log
+
+- 2026-05-06 12:37 CST: generated both remote specs and verified that the only
+  normalized difference is `llm.use_harness`.
+- 2026-05-06 12:37 CST: started no-harness in tmux session
+  `qwen27b_tpot25_noharness_16iter_20260506`.
+- 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it
+  for comparison. It used `restart_engine_after_early_stop=false`; the first
+  TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but
+  unfinished requests remained live in vLLM and would contaminate the next probe.
+- 2026-05-06 21:07 CST: generated the formal clean specs with
+  `restart_engine_after_early_stop=true` for both variants and verified the
+  normalized diff is still only `llm.use_harness`.
+- 2026-05-06 21:09 CST: started formal no-harness run in tmux session
+  `qwen27b_tpot25_restart_noharness_16iter_20260506`.
+
+## Results
+
+Pending.