Add reference paper and qwen27b tpot25 16-iter notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
106
docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
Normal file
106
docs/qwen27b-chat-0-8k-tpot25-16iter-20260506.md
Normal file
@@ -0,0 +1,106 @@
|
|||||||
|
# qwen27b-chat-0-8k TPOT25 16-Iter Harness Compare
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Rerun the internal vLLM Qwen3.5-27B chat 0-8k tuning comparison under a stricter
|
||||||
|
TPOT SLO:
|
||||||
|
|
||||||
|
- no-harness: 16 tuning iterations;
|
||||||
|
- harness: 16 tuning iterations, with permission to stop early if the harness
|
||||||
|
convergence guard decides no further GPU trial is needed.
|
||||||
|
|
||||||
|
Both variants must be launched directly through AITuner. No state seeding,
|
||||||
|
manual replay, or historical-result injection is allowed.
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Host: `dash0`.
|
||||||
|
- Hardware: 8 NVIDIA H20 GPUs.
|
||||||
|
- Engine: internal vLLM at `/usr/local/bin/vllm`.
|
||||||
|
- Model:
|
||||||
|
`/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`.
|
||||||
|
- Served model name: `qwen35-27b-aituner`.
|
||||||
|
- Workload window: `chat_w20260311_1000`.
|
||||||
|
- Trace path source: `/home/admin/cpfs/wjh/aituner/aituner/trace_windows/windows.json`.
|
||||||
|
- Request mode: `chat`.
|
||||||
|
- Input bucket: `0 <= input_length <= 8192`.
|
||||||
|
- Replay scale: `1.0`.
|
||||||
|
- Max concurrency: `32`.
|
||||||
|
- Max requests per probe: unset, so each probe uses the full selected trace
|
||||||
|
subset for its `sampling_u` threshold.
|
||||||
|
- Restart engine after early stop: `true` for both variants. This is needed
|
||||||
|
under TPOT25 because very slow infeasible probes can leave live HTTP requests
|
||||||
|
in the engine after the SLO is already unrecoverable. Restarting keeps the
|
||||||
|
next binary-search probe from being contaminated by previous in-flight work.
|
||||||
|
- Search field: `sampling_u`.
|
||||||
|
- Search range: `low=0.0`, `high=0.0625`.
|
||||||
|
- Search probes: `max_probes=6`, `tolerance=0.001`.
|
||||||
|
- Sampling seed: `20260325`.
|
||||||
|
|
||||||
|
## SLO
|
||||||
|
|
||||||
|
- Target pass rate: `0.95`.
|
||||||
|
- TTFT rule:
|
||||||
|
|
||||||
|
| Input tokens | TTFT threshold |
|
||||||
|
| ---: | ---: |
|
||||||
|
| `<=4096` | `2000 ms` |
|
||||||
|
| `<=32768` | `4000 ms` |
|
||||||
|
| otherwise | `6000 ms` |
|
||||||
|
|
||||||
|
- TPOT rule: fixed `<=25 ms`.
|
||||||
|
|
||||||
|
## Specs
|
||||||
|
|
||||||
|
Remote generated specs:
|
||||||
|
|
||||||
|
- no-harness:
|
||||||
|
`.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json`
|
||||||
|
- harness:
|
||||||
|
`.aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json`
|
||||||
|
|
||||||
|
The two specs were generated from
|
||||||
|
`configs/examples/dash0_qwen27b_tight_slo_run4_0_8k.json`. After normalizing
|
||||||
|
`study_id` and `llm.use_harness`, the JSON payloads compare equal. Therefore the
|
||||||
|
only tuning-behavior difference between the formal comparison runs is whether
|
||||||
|
the harness is enabled.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
No-harness:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PYTHONPATH=src python3 -m aituner.cli study tune \
|
||||||
|
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-noharness.json \
|
||||||
|
--store-root .aituner-tight \
|
||||||
|
--max-trials 16
|
||||||
|
```
|
||||||
|
|
||||||
|
Harness:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
PYTHONPATH=src python3 -m aituner.cli study tune \
|
||||||
|
--spec .aituner-tight/specs/dash0-qwen27b-chat-0-8k-tpot25-restart-16iter-harness.json \
|
||||||
|
--store-root .aituner-tight \
|
||||||
|
--max-trials 16
|
||||||
|
```
|
||||||
|
|
||||||
|
## Run Log
|
||||||
|
|
||||||
|
- 2026-05-06 12:37 CST: generated both remote specs and verified that the only
|
||||||
|
normalized difference is `llm.use_harness`.
|
||||||
|
- 2026-05-06 12:37 CST: started no-harness in tmux session
|
||||||
|
`qwen27b_tpot25_noharness_16iter_20260506`.
|
||||||
|
- 2026-05-06 21:06 CST: stopped the initial no-harness pre-run before using it
|
||||||
|
for comparison. It used `restart_engine_after_early_stop=false`; the first
|
||||||
|
TP1 baseline probe already recorded `slo_pass_rate_unrecoverable`, but
|
||||||
|
unfinished requests remained live in vLLM and would contaminate the next probe.
|
||||||
|
- 2026-05-06 21:07 CST: generated the formal clean specs with
|
||||||
|
`restart_engine_after_early_stop=true` for both variants and verified the
|
||||||
|
normalized diff is still only `llm.use_harness`.
|
||||||
|
- 2026-05-06 21:09 CST: started formal no-harness run in tmux session
|
||||||
|
`qwen27b_tpot25_restart_noharness_16iter_20260506`.
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
Pending.
|
||||||
Reference in New Issue
Block a user