Add reference paper and qwen27b tpot25 16-iter notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
parent 27d1c8fa92
commit 8b4116fad0
2 changed files with 106 additions and 0 deletions
--- a/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
+++ b/docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
@@ -0,0 +1,106 @@
 # qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare
 ## Goal
 Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter
 TPOT SLO:
 - no-harness: 16 tuning iterations;
 - harness: 16 tuning iterations, with permission to stop early if the harness
  convergence guard decides no further GPU trial is needed.
 Both variants must be launched directly through AITuner. No state seeding,
 manual replay, or historical-result injection is allowed.
 ## Setup
 - Host: `dash0`.
 - Hardware: 8 NVIDIA H20 GPUs.
 - Engine: internal vLLM at `/usr/local/bin/vllm`.
 - Model:
  `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
 - Served model name: `qwen35-27b-aituner`.
 - Workload window: `chat_w20260311_1000`.
 - Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`.
 - Request mode: `chat`.
 - Input bucket: `0 <= input_length <= 8192`.
 - Replay scale: `1.0`.
 - Max concurrency: `32`.
 - Max requests per probe: unset, so each probe uses the full selected trace
  subset for its `sampling_u` threshold.
 - Restart engine after early stop: `true` for both variants. This is needed
  under TPOT25 because very slow infeasible probes can leave live HTTP requests
  in the engine after the SLO is already unrecoverable. Restarting keeps the
  next binary-search probe from being contaminated by previous in-flight work.
 - Search field: `sampling_u`.
 - Search range: `low=0.0`, `high=0.0625`.
 - Search probes: `max_probes=6`, `tolerance=0.001`.
 - Sampling seed: `20260325`.
 ## SLO
 - Target pass rate: `0.95`.
 - TTFT rule:
 | Input tokens | TTFT threshold |
 | ---: | ---: |
 | `<=4096` | `2000 ms` |
 | `<=32768` | `4000 ms` |
 | otherwise | `6000 ms` |
 - TPOT rule: fixed `<=25 ms`.
 ## Specs
 Remote generated specs:
 - no-harness:
  `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json`
 - harness:
  `.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json`
 The two specs were generated from
 `configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing
 `study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the
 only tuning-behavior difference between the formal comparison runs is whether
 the harness is enabled.
 ## Commands
 No-harness:
 ```bash
 PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
  --store-root .aituner-tight \
  --max-trials 16
 ```
 Harness:
 ```bash
 PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
  --store-root .aituner-tight \
  --max-trials 16
 ```
 ## Run Log
 - 2026-05-06 12:37 CST: generated both remote specs and verified that the only
  normalized difference is `llm.use_harness`.
 - 2026-05-06 12:37 CST: started no-harness in tmux session
  `qwen27b_tpot25_noharness_16iter_20260506`.
 - 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it
  for comparison. It used `restart_engine_after_early_stop=false`; the first
  TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but
  unfinished requests remained live in vLLM and would contaminate the next probe.
 - 2026-05-06 21:07 CST: generated the formal clean specs with
  `restart_engine_after_early_stop=true` for both variants and verified the
  normalized diff is still only `llm.use_harness`.
 - 2026-05-06 21:09 CST: started formal no-harness run in tmux session
  `qwen27b_tpot25_restart_noharness_16iter_20260506`.
 ## Results
 Pending.
--- a/paper.pdf
+++ b/paper.pdf