Add reference paper and qwen27b tpot25 16-iter notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 14:02:30 +08:00
parent 27d1c8fa92
commit 8b4116fad0
2 changed files with 106 additions and 0 deletions

View File

@@ -0,0 +1,106 @@
# qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare
## Goal
Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter
TPOT SLO:
- no-harness: 16 tuning iterations;
- harness: 16 tuning iterations, with permission to stop early if the harness
convergence guard decides no further GPU trial is needed.
Both variants must be launched directly through AITuner. No state seeding,
manual replay, or historical-result injection is allowed.
## Setup
- Host: `dash0`.
- Hardware: 8 NVIDIA H20 GPUs.
- Engine: internal vLLM at `/usr/local/bin/vllm`.
- Model:
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
- Served model name: `qwen35-27b-aituner`.
- Workload window: `chat_w20260311_1000`.
- Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`.
- Request mode: `chat`.
- Input bucket: `0 <= input_length <= 8192`.
- Replay scale: `1.0`.
- Max concurrency: `32`.
- Max requests per probe: unset, so each probe uses the full selected trace
subset for its `sampling_u` threshold.
- Restart engine after early stop: `true` for both variants. This is needed
under TPOT25 because very slow infeasible probes can leave live HTTP requests
in the engine after the SLO is already unrecoverable. Restarting keeps the
next binary-search probe from being contaminated by previous in-flight work.
- Search field: `sampling_u`.
- Search range: `low=0.0`, `high=0.0625`.
- Search probes: `max_probes=6`, `tolerance=0.001`.
- Sampling seed: `20260325`.
## SLO
- Target pass rate: `0.95`.
- TTFT rule:
| Input tokens | TTFT threshold |
| ---: | ---: |
| `<=4096` | `2000 ms` |
| `<=32768` | `4000 ms` |
| otherwise | `6000 ms` |
- TPOT rule: fixed `<=25 ms`.
## Specs
Remote generated specs:
- no-harness:
`.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json`
- harness:
`.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json`
The two specs were generated from
`configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing
`study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the
only tuning-behavior difference between the formal comparison runs is whether
the harness is enabled.
## Commands
No-harness:
```bash
PYTHONPATH=src python3 -m aituner.cli study tune \
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
--store-root .aituner-tight \
--max-trials 16
```
Harness:
```bash
PYTHONPATH=src python3 -m aituner.cli study tune \
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
--store-root .aituner-tight \
--max-trials 16
```
## Run Log
- 2026-05-06 12:37 CST: generated both remote specs and verified that the only
normalized difference is `llm.use_harness`.
- 2026-05-06 12:37 CST: started no-harness in tmux session
`qwen27b_tpot25_noharness_16iter_20260506`.
- 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it
for comparison. It used `restart_engine_after_early_stop=false`; the first
TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but
unfinished requests remained live in vLLM and would contaminate the next probe.
- 2026-05-06 21:07 CST: generated the formal clean specs with
`restart_engine_after_early_stop=true` for both variants and verified the
normalized diff is still only `llm.use_harness`.
- 2026-05-06 21:09 CST: started formal no-harness run in tmux session
`qwen27b_tpot25_restart_noharness_16iter_20260506`.
## Results
Pending.