4.6 KiB
Qwen27B Chat 0-8k Harness Ablation
Date: 2026-05-10
Setup
- Host:
dash0(172.27.114.84) - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Workload: chat, 0-8k input window
- SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
- Trial budget: 12 total tuning iterations per study
- Execution: direct
python3 -m aituner.cli study tune ... --max-trials 12 - GPU env:
CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7 - Code commit:
adc4351
The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.
Studies
| Variant | Study ID |
|---|---|
| no-harness rerun | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509 |
| harness | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508 |
Result
| Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary |
|---|---|---|---|---|
| no-harness rerun | 10 | 0.4050 | 0.2025 | tensor-parallel-size=2, data-parallel-size=1, max-num-batched-tokens=12288 |
| harness | 8 | 1.0967 | 0.2742 | tensor-parallel-size=4, enable-chunked-prefill=true, max-num-batched-tokens=16384 |
Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.
Incumbent Curve
Values are incumbent best request rate per GPU after each tuning iteration.
| Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness rerun | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 |
| harness | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2696 | 0.2742 | 0.2742 | 0.2742 | 0.2742 | stop |
Trial Details
No-harness rerun:
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | 0.0617 | 0.0650 | completed | tp=1, dp=2, max-num-batched-tokens=12288 |
| 3 | 0.0308 | 0.0650 | completed | tp=1, dp=4 |
| 4 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=12288 |
| 5 | - | 0.0650 | completed, infeasible | tp=1, dp=2, max-num-batched-tokens=16384 |
| 6 | - | 0.0650 | completed, infeasible | tp=1, dp=2, max-num-batched-tokens=12288, block-size=32 |
| 7 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=10240 |
| 8 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=7168 |
| 9 | - | 0.0650 | completed, infeasible | tp=1, dp=2 |
| 10 | 0.2025 | 0.2025 | completed | tp=2, dp=1, max-num-batched-tokens=12288 |
| 11 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-batched-tokens=10240 |
| 12 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-batched-tokens=13312 |
Harness:
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | 0.0617 | 0.0650 | completed | tp=1, dp=2 |
| 3 | 0.2025 | 0.2025 | completed | tp=2, dp=1 |
| 4 | - | 0.2025 | completed, infeasible | tp=2, chunked prefill, max-num-batched-tokens=16384 |
| 5 | 0.1283 | 0.2025 | completed | tp=2, dp=2 |
| 6 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-seqs=4 |
| 7 | 0.2696 | 0.2696 | completed | tp=4, dp=1 |
| 8 | 0.2742 | 0.2742 | completed | tp=4, chunked prefill, max-num-batched-tokens=16384 |
| 9 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=24576 |
| 10 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=8 |
| 11 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=16 |
| 12 | - | 0.2742 | harness stop | validation exhausted after strong incumbent |
Interpretation
The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:
- It reaches the
tp=2topology by iter 3 instead of iter 10. - It then escalates to
tp=4and a nearby batching refinement, reaching0.2742 req/s/GPU.
The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.