Files
aituner/docs/harness-ablation/qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md

4.6 KiB

Qwen27B Chat 0-8k Harness Ablation

Date: 2026-05-10

Setup

  • Host: dash0 (172.27.114.84)
  • Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
  • Workload: chat, 0-8k input window
  • SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
  • Trial budget: 12 total tuning iterations per study
  • Execution: direct python3 -m aituner.cli study tune ... --max-trials 12
  • GPU env: CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7
  • Code commit: adc4351

The previous no-harness run was affected by the dash0 migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.

Studies

Variant Study ID
no-harness rerun dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509
harness dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508

Result

Variant Best iter Best request rate Best request rate / GPU Best config summary
no-harness rerun 10 0.4050 0.2025 tensor-parallel-size=2, data-parallel-size=1, max-num-batched-tokens=12288
harness 8 1.0967 0.2742 tensor-parallel-size=4, enable-chunked-prefill=true, max-num-batched-tokens=16384

Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about 35.4% over the clean no-harness rerun.

Incumbent Curve

Values are incumbent best request rate per GPU after each tuning iteration.

Variant 1 2 3 4 5 6 7 8 9 10 11 12
no-harness rerun 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.0650 0.2025 0.2025 0.2025
harness 0.0650 0.0650 0.2025 0.2025 0.2025 0.2025 0.2696 0.2742 0.2742 0.2742 0.2742 stop

Trial Details

No-harness rerun:

Iter Trial result / GPU Incumbent / GPU Status Config summary
1 0.0650 0.0650 completed baseline
2 0.0617 0.0650 completed tp=1, dp=2, max-num-batched-tokens=12288
3 0.0308 0.0650 completed tp=1, dp=4
4 - 0.0650 completed, infeasible max-num-batched-tokens=12288
5 - 0.0650 completed, infeasible tp=1, dp=2, max-num-batched-tokens=16384
6 - 0.0650 completed, infeasible tp=1, dp=2, max-num-batched-tokens=12288, block-size=32
7 - 0.0650 completed, infeasible max-num-batched-tokens=10240
8 - 0.0650 completed, infeasible max-num-batched-tokens=7168
9 - 0.0650 completed, infeasible tp=1, dp=2
10 0.2025 0.2025 completed tp=2, dp=1, max-num-batched-tokens=12288
11 - 0.2025 completed, infeasible tp=2, dp=1, max-num-batched-tokens=10240
12 - 0.2025 completed, infeasible tp=2, dp=1, max-num-batched-tokens=13312

Harness:

Iter Trial result / GPU Incumbent / GPU Status Config summary
1 0.0650 0.0650 completed baseline
2 0.0617 0.0650 completed tp=1, dp=2
3 0.2025 0.2025 completed tp=2, dp=1
4 - 0.2025 completed, infeasible tp=2, chunked prefill, max-num-batched-tokens=16384
5 0.1283 0.2025 completed tp=2, dp=2
6 - 0.2025 completed, infeasible tp=2, dp=1, max-num-seqs=4
7 0.2696 0.2696 completed tp=4, dp=1
8 0.2742 0.2742 completed tp=4, chunked prefill, max-num-batched-tokens=16384
9 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=24576
10 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=8
11 - 0.2742 completed, infeasible tp=4, chunked prefill, max-num-batched-tokens=16384, max-num-seqs=16
12 - 0.2742 harness stop validation exhausted after strong incumbent

Interpretation

The clean no-harness rerun eventually found the tp=2 topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:

  • It reaches the tp=2 topology by iter 3 instead of iter 10.
  • It then escalates to tp=4 and a nearby batching refinement, reaching 0.2742 req/s/GPU.

The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.