Qwen27B Chat 0-8k Harness Ablation

Date: 2026-05-10

Setup

Host: dash0 (172.27.114.84)
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
Workload: chat, 0-8k input window
SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
Trial budget: 12 total tuning iterations per study
Execution: direct python3 -m aituner.cli study tune ... --max-trials 12
GPU env: CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7
Code commit: adc4351

The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.

Studies

Variant	Study ID
no-harness rerun	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509`
harness	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508`

Result

The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).

Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point in the measured search range. stop means the harness stopped before launching another GPU trial.

Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent sampling_u as the new search floor. If the config was infeasible above that floor, the old worker wrote NA without searching below the floor. Therefore the NA entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance.

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness raw `perf[i]`	0.0650	0.0617	0.0308	NA	NA	NA	NA	NA	NA	0.2025	NA	NA
harness raw `perf[i]`	0.0650	0.0617	0.2025	NA	0.1283	NA	0.2696	0.2742	NA	NA	NA	stop

The raw no-harness curve is not monotonic: iter2 and iter3 are worse than the baseline, and iter4-9 do not produce feasible configs. The monotonic curve below is best-so-far/incumbent tracking, not the measured performance of each proposal.

Variant	Best iter	Best request rate	Best request rate / GPU	Best config summary
no-harness rerun	10	0.4050	0.2025	`tensor-parallel-size=2`, `data-parallel-size=1`, `max-num-batched-tokens=12288`
harness	8	1.0967	0.2742	`tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=16384`

Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.

Incumbent Curve

Values are incumbent best request rate per GPU after each tuning iteration. This table is useful for explaining final best selection, but it should not be used as Fig18 raw perf[i].

Variant	1	2	3	4	5	6	7	8	9	10	11	12
no-harness rerun	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.0650	0.2025	0.2025	0.2025
harness	0.0650	0.0650	0.2025	0.2025	0.2025	0.2025	0.2696	0.2742	0.2742	0.2742	0.2742	stop

For plotting raw perf[i], keep NA points missing or render them as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.

Trial Details

No-harness rerun:

Iter	Trial result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	0.0617	0.0650	completed	`tp=1`, `dp=2`, `max-num-batched-tokens=12288`
3	0.0308	0.0650	completed	`tp=1`, `dp=4`
4	-	0.0650	completed, infeasible	`max-num-batched-tokens=12288`
5	-	0.0650	completed, infeasible	`tp=1`, `dp=2`, `max-num-batched-tokens=16384`
6	-	0.0650	completed, infeasible	`tp=1`, `dp=2`, `max-num-batched-tokens=12288`, `block-size=32`
7	-	0.0650	completed, infeasible	`max-num-batched-tokens=10240`
8	-	0.0650	completed, infeasible	`max-num-batched-tokens=7168`
9	-	0.0650	completed, infeasible	`tp=1`, `dp=2`
10	0.2025	0.2025	completed	`tp=2`, `dp=1`, `max-num-batched-tokens=12288`
11	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-batched-tokens=10240`
12	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-batched-tokens=13312`

Harness:

Iter	Trial result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	0.0617	0.0650	completed	`tp=1`, `dp=2`
3	0.2025	0.2025	completed	`tp=2`, `dp=1`
4	-	0.2025	completed, infeasible	`tp=2`, chunked prefill, `max-num-batched-tokens=16384`
5	0.1283	0.2025	completed	`tp=2`, `dp=2`
6	-	0.2025	completed, infeasible	`tp=2`, `dp=1`, `max-num-seqs=4`
7	0.2696	0.2696	completed	`tp=4`, `dp=1`
8	0.2742	0.2742	completed	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`
9	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=24576`
10	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=8`
11	-	0.2742	completed, infeasible	`tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=16`
12	-	0.2742	harness stop	validation exhausted after strong incumbent

Interpretation

The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:

It reaches the tp=2 topology by iter 3 instead of iter 10.
It then escalates to tp=4 and a nearby batching refinement, reaching 0.2742 req/s/GPU.

The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.

6.5 KiB Raw Blame History