Qwen27B Chat 0-8k Harness Ablation

Date: 2026-05-10

Setup

Host: dash0 (172.27.114.84)
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
Workload: chat, 0-8k input window
SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
Trial budget: 12 total tuning iterations per study
Execution: direct python3 -m aituner.cli study tune ... --max-trials 12
GPU env: CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7
Code commit: adc4351

The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.

Studies

Variant	Study ID
no-harness rerun	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509`
harness	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508`

Result

Variant	Best iter	Best request rate	Best request rate / GPU	Best config summary
no-harness rerun	10	0.4050	0.2025	`tensor-parallel-size=2`, `data-parallel-size=1`, `max-num-batched-tokens=12288`
harness	8	1.0967	0.2742	`tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=16384`

Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.

Incumbent Curve

Values are incumbent best request rate per GPU after each tuning iteration.

Variant	1	2	3	4	5	6	7	8	9	10	11	12
no-harness rerun	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.2025	0.2025	0.2025
harness	0.0650	0.0650	0.2025	0.2025	0.2025	0.2025	0.2696	0.2742	0.2742	0.2742	0.2742	stop

Trial Details

No-harness rerun:

Iter	Trial result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	0.0617	0.0650	completed	`tp=1`, `dp=2`, `max-num-batched-tokens=12288`
3	0.0308	0.0650	completed	`tp=1`, `dp=4`
4	-	0.0650	completed, infeasible	`max-num-batched-tokens=12288`
5	-	0.0650	completed, infeasible	`tp=1`, `dp=2`, `max-num-batched-tokens=16384`
6	-	0.0650	completed, infeasible	`tp=1`, `dp=2`, `max-num-batched-tokens=12288`, `block-size=32`
7	-	0.0650	completed, infeasible	`max-num-batched-tokens=10240`
8	-	0.0650	completed, infeasible	`max-num-batched-tokens=7168`
9	-	0.0650	completed, infeasible	`tp=1`, `dp=2`
10	0.2025	0.2025	completed	`tp=2`, `dp=1`, `max-num-batched-tokens=12288`
11	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-batched-tokens=10240`
12	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-batched-tokens=13312`

Harness:

Iter	Trial result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	0.0617	0.0650	completed	`tp=1`, `dp=2`
3	0.2025	0.2025	completed	`tp=2`, `dp=1`
4	-	0.2025	completed, infeasible	`tp=2`, chunked prefill, `max-num-batched-tokens=16384`
5	0.1283	0.2025	completed	`tp=2`, `dp=2`
6	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-seqs=4`
7	0.2696	0.2696	completed	`tp=4`, `dp=1`
8	0.2742	0.2742	completed	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`
9	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=24576`
10	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=8`
11	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=16`
12	-	0.2742	harness stop	validation exhausted after strong incumbent

Interpretation

The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:

It reaches the tp=2 topology by iter 3 instead of iter 10.
It then escalates to tp=4 and a nearby batching refinement, reaching 0.2742 req/s/GPU.

The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.

4.6 KiB Raw Blame History