# Qwen27B Chat 0-8k Harness Ablation Date: 2026-05-10 ## Setup - Host: `dash0` (`172.27.114.84`) - Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal` - Workload: chat, 0-8k input window - SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95 - Trial budget: 12 total tuning iterations per study - Execution: direct `python3 -m aituner.cli study tune ... --max-trials 12` - GPU env: `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7` - Code commit: `adc4351` The previous no-harness run was affected by the `dash0` migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09. ## Studies | Variant | Study ID | | --- | --- | | no-harness rerun | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509` | | harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508` | ## Result The table below is the raw per-iteration performance for a Fig18-style plot. Use this table as `perf[i]`; do not replace missing points with `max(perf[:i+1])`. Metric: `best_request_rate_per_gpu` from that trial's own `result.json`. `NA` means the proposed config did not produce a feasible point in the measured search range. `stop` means the harness stopped before launching another GPU trial. Important caveat: these runs were produced before the lower-range fallback fix. For same-parallel-size runtime patches, AITuner inherited the incumbent `sampling_u` as the new search floor. If the config was infeasible above that floor, the old worker wrote `NA` without searching below the floor. Therefore the `NA` entries below are not complete Fig18-quality raw performance points; they are "no feasible point above inherited floor." A rerun with the fixed worker is required to fill their true lower-load performance. | Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | no-harness raw `perf[i]` | 0.0650 | 0.0617 | 0.0308 | NA | NA | NA | NA | NA | NA | 0.2025 | NA | NA | | harness raw `perf[i]` | 0.0650 | 0.0617 | 0.2025 | NA | 0.1283 | NA | 0.2696 | 0.2742 | NA | NA | NA | stop | The raw no-harness curve is not monotonic: iter2 and iter3 are worse than the baseline, and iter4-9 do not produce feasible configs. The monotonic curve below is best-so-far/incumbent tracking, not the measured performance of each proposal. | Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary | | --- | ---: | ---: | ---: | --- | | no-harness rerun | 10 | 0.4050 | 0.2025 | `tensor-parallel-size=2`, `data-parallel-size=1`, `max-num-batched-tokens=12288` | | harness | 8 | 1.0967 | 0.2742 | `tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=16384` | Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about `35.4%` over the clean no-harness rerun. ## Incumbent Curve Values are incumbent best request rate per GPU after each tuning iteration. This table is useful for explaining final best selection, but it should not be used as Fig18 raw `perf[i]`. | Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | no-harness rerun | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 | | harness | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2696 | 0.2742 | 0.2742 | 0.2742 | 0.2742 | stop | For plotting raw `perf[i]`, keep `NA` points missing or render them as invalid trials. If a plotting script requires numeric values, use `0` only with an explicit label that this means "no feasible configuration under the configured SLO"; do not forward-fill from the incumbent. ## Trial Details No-harness rerun: | Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary | | ---: | ---: | ---: | --- | --- | | 1 | 0.0650 | 0.0650 | completed | baseline | | 2 | 0.0617 | 0.0650 | completed | `tp=1`, `dp=2`, `max-num-batched-tokens=12288` | | 3 | 0.0308 | 0.0650 | completed | `tp=1`, `dp=4` | | 4 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=12288` | | 5 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2`, `max-num-batched-tokens=16384` | | 6 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2`, `max-num-batched-tokens=12288`, `block-size=32` | | 7 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=10240` | | 8 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=7168` | | 9 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2` | | 10 | 0.2025 | 0.2025 | completed | `tp=2`, `dp=1`, `max-num-batched-tokens=12288` | | 11 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-batched-tokens=10240` | | 12 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-batched-tokens=13312` | Harness: | Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary | | ---: | ---: | ---: | --- | --- | | 1 | 0.0650 | 0.0650 | completed | baseline | | 2 | 0.0617 | 0.0650 | completed | `tp=1`, `dp=2` | | 3 | 0.2025 | 0.2025 | completed | `tp=2`, `dp=1` | | 4 | - | 0.2025 | completed, infeasible | `tp=2`, chunked prefill, `max-num-batched-tokens=16384` | | 5 | 0.1283 | 0.2025 | completed | `tp=2`, `dp=2` | | 6 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-seqs=4` | | 7 | 0.2696 | 0.2696 | completed | `tp=4`, `dp=1` | | 8 | 0.2742 | 0.2742 | completed | `tp=4`, chunked prefill, `max-num-batched-tokens=16384` | | 9 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=24576` | | 10 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=8` | | 11 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=16` | | 12 | - | 0.2742 | harness stop | validation exhausted after strong incumbent | ## Interpretation The clean no-harness rerun eventually found the `tp=2` topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways: - It reaches the `tp=2` topology by iter 3 instead of iter 10. - It then escalates to `tp=4` and a nearby batching refinement, reaching `0.2742 req/s/GPU`. The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.