6.0 KiB
Qwen27B Chat 0-8k Harness Ablation
Date: 2026-05-10
Setup
- Host:
dash0(172.27.114.84) - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Workload: chat, 0-8k input window
- SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
- Trial budget: 12 total tuning iterations per study
- Execution: direct
python3 -m aituner.cli study tune ... --max-trials 12 - GPU env:
CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7 - Code commit:
adc4351
The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.
Studies
| Variant | Study ID |
|---|---|
| no-harness rerun | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509 |
| harness | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508 |
Result
The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).
Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point under the SLO. stop means the harness stopped before launching another GPU trial.
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
no-harness raw perf[i] |
0.0650 | 0.0617 | 0.0308 | NA | NA | NA | NA | NA | NA | 0.2025 | NA | NA |
harness raw perf[i] |
0.0650 | 0.0617 | 0.2025 | NA | 0.1283 | NA | 0.2696 | 0.2742 | NA | NA | NA | stop |
The raw no-harness curve is not monotonic: iter2 and iter3 are worse than the baseline, and iter4-9 do not produce feasible configs. The monotonic curve below is best-so-far/incumbent tracking, not the measured performance of each proposal.
| Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary |
|---|---|---|---|---|
| no-harness rerun | 10 | 0.4050 | 0.2025 | tensor-parallel-size=2, data-parallel-size=1, max-num-batched-tokens=12288 |
| harness | 8 | 1.0967 | 0.2742 | tensor-parallel-size=4, enable-chunked-prefill=true, max-num-batched-tokens=16384 |
Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.
Incumbent Curve
Values are incumbent best request rate per GPU after each tuning iteration. This table is useful for explaining final best selection, but it should not be used as Fig18 raw perf[i].
| Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness rerun | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 |
| harness | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2696 | 0.2742 | 0.2742 | 0.2742 | 0.2742 | stop |
For plotting raw perf[i], keep NA points missing or render them as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.
Trial Details
No-harness rerun:
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | 0.0617 | 0.0650 | completed | tp=1, dp=2, max-num-batched-tokens=12288 |
| 3 | 0.0308 | 0.0650 | completed | tp=1, dp=4 |
| 4 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=12288 |
| 5 | - | 0.0650 | completed, infeasible | tp=1, dp=2, max-num-batched-tokens=16384 |
| 6 | - | 0.0650 | completed, infeasible | tp=1, dp=2, max-num-batched-tokens=12288, block-size=32 |
| 7 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=10240 |
| 8 | - | 0.0650 | completed, infeasible | max-num-batched-tokens=7168 |
| 9 | - | 0.0650 | completed, infeasible | tp=1, dp=2 |
| 10 | 0.2025 | 0.2025 | completed | tp=2, dp=1, max-num-batched-tokens=12288 |
| 11 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-batched-tokens=10240 |
| 12 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-batched-tokens=13312 |
Harness:
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | 0.0617 | 0.0650 | completed | tp=1, dp=2 |
| 3 | 0.2025 | 0.2025 | completed | tp=2, dp=1 |
| 4 | - | 0.2025 | completed, infeasible | tp=2, chunked prefill, max-num-batched-tokens=16384 |
| 5 | 0.1283 | 0.2025 | completed | tp=2, dp=2 |
| 6 | - | 0.2025 | completed, infeasible | tp=2, dp=1, max-num-seqs=4 |
| 7 | 0.2696 | 0.2696 | completed | tp=4, dp=1 |
| 8 | 0.2742 | 0.2742 | completed | tp=4, chunked prefill, max-num-batched-tokens=16384 |
| 9 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=24576 |
| 10 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=8 |
| 11 | - | 0.2742 | completed, infeasible | tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=16 |
| 12 | - | 0.2742 | harness stop | validation exhausted after strong incumbent |
Interpretation
The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:
- It reaches the
tp=2topology by iter 3 instead of iter 10. - It then escalates to
tp=4and a nearby batching refinement, reaching0.2742 req/s/GPU.
The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.