Files
aituner/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md

6.5 KiB

Qwen27B Chat 0-8k Harness Ablation

Date: 2026-05-10

Setup

  • Host: dash0 (172.27.114.84)
  • Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
  • Workload: chat, 0-8k input window
  • SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
  • Trial budget: 12 total tuning iterations per study
  • Execution: direct python3 -m aituner.cli study tune ... --max-trials 12
  • GPU env: CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7
  • Code commit: adc4351

The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.

Studies

Variant Study ID
no-harness rerun dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509
harness dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508

Result

The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as perf[i]; do not replace missing points with max(perf[:i+1]).

Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point in the measured search range. stop means the harness stopped before launching another GPU trial.

Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent sampling_u as the new search floor. If the config was infeasible above that floor, the old worker wrote NA without searching below the floor. Therefore the NA entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance.

Variant iter1 iter2 iter3 iter4 iter5 iter6 iter7 iter8 iter9 iter10 iter11 iter12
no-harness raw perf[i] 0.0650 0.0617 0.0308 NA NA NA NA NA NA 0.2025 NA NA
harness raw perf[i] 0.0650 0.0617 0.2025 NA 0.1283 NA 0.2696 0.2742 NA NA NA stop

The raw no-harness curve is not monotonic: iter2 and iter3 are worse than the baseline, and iter4-9 do not produce feasible configs. The monotonic curve below is best-so-far/incumbent tracking, not the measured performance of each proposal.

Variant Best iter Best request rate Best request rate / GPU Best config summary
no-harness rerun 10 0.4050 0.2025 tensor-parallel-size=2, data-parallel-size=1, max-num-batched-tokens=12288
harness 8 1.0967 0.2742 tensor-parallel-size=4, enable-chunked-prefill=true, max-num-batched-tokens=16384

Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.

Incumbent Curve

Values are incumbent best request rate per GPU after each tuning iteration. This table is useful for explaining final best selection, but it should not be used as Fig18 raw perf[i].

Variant 1 2 3 4 5 6 7 8 9 10 11 12
no-harness rerun 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.2025 0.2025 0.2025
harness 0.0650 0.0650 0.2025 0.2025 0.2025 0.2025 0.2696 0.2742 0.2742 0.2742 0.2742 stop

For plotting raw perf[i], keep NA points missing or render them as invalid trials. If a plotting script requires numeric values, use 0 only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent.

Trial Details

No-harness rerun:

Iter Trial result / GPU Incumbent / GPU Status Config summary
1 0.0650 0.0650 completed baseline
2 0.0617 0.0650 completed tp=1, dp=2, max-num-batched-tokens=12288
3 0.0308 0.0650 completed tp=1, dp=4
4 - 0.0650 completed, infeasible max-num-batched-tokens=12288
5 - 0.0650 completed, infeasible tp=1, dp=2, max-num-batched-tokens=16384
6 - 0.0650 completed, infeasible tp=1, dp=2, max-num-batched-tokens=12288, block-size=32
7 - 0.0650 completed, infeasible max-num-batched-tokens=10240
8 - 0.0650 completed, infeasible max-num-batched-tokens=7168
9 - 0.0650 completed, infeasible tp=1, dp=2
10 0.2025 0.2025 completed tp=2, dp=1, max-num-batched-tokens=12288
11 - 0.2025 completed, infeasible tp=2, dp=1, max-num-batched-tokens=10240
12 - 0.2025 completed, infeasible tp=2, dp=1, max-num-batched-tokens=13312

Harness:

Iter Trial result / GPU Incumbent / GPU Status Config summary
1 0.0650 0.0650 completed baseline
2 0.0617 0.0650 completed tp=1, dp=2
3 0.2025 0.2025 completed tp=2, dp=1
4 - 0.2025 completed, infeasible tp=2, chunked prefill, max-num-batched-tokens=16384
5 0.1283 0.2025 completed tp=2, dp=2
6 - 0.2025 completed, infeasible tp=2, dp=1, max-num-seqs=4
7 0.2696 0.2696 completed tp=4, dp=1
8 0.2742 0.2742 completed tp=4, chunked prefill, max-num-batched-tokens=16384
9 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=24576
10 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=8
11 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=16
12 - 0.2742 harness stop validation exhausted after strong incumbent

Interpretation

The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:

  • It reaches the tp=2 topology by iter 3 instead of iter 10.
  • It then escalates to tp=4 and a nearby batching refinement, reaching 0.2742 req/s/GPU.

The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.