Document qwen27b harness ablation
This commit is contained in:
@@ -0,0 +1,86 @@
|
|||||||
|
# Qwen27B Chat 0-8k Harness Ablation
|
||||||
|
|
||||||
|
Date: 2026-05-10
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
- Host: `dash0` (`172.27.114.84`)
|
||||||
|
- Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`
|
||||||
|
- Workload: chat, 0-8k input window
|
||||||
|
- SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
|
||||||
|
- Trial budget: 12 total tuning iterations per study
|
||||||
|
- Execution: direct `python3 -m aituner.cli study tune ... --max-trials 12`
|
||||||
|
- GPU env: `CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7`
|
||||||
|
- Code commit: `adc4351`
|
||||||
|
|
||||||
|
The previous no-harness run was affected by the `dash0` migration and had many engine launch failures. This document uses the clean no-harness rerun from 2026-05-09.
|
||||||
|
|
||||||
|
## Studies
|
||||||
|
|
||||||
|
| Variant | Study ID |
|
||||||
|
| --- | --- |
|
||||||
|
| no-harness rerun | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-noharness-rerun-20260509` |
|
||||||
|
| harness | `dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu3skip-12iter-harness-20260508` |
|
||||||
|
|
||||||
|
## Result
|
||||||
|
|
||||||
|
| Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary |
|
||||||
|
| --- | ---: | ---: | ---: | --- |
|
||||||
|
| no-harness rerun | 10 | 0.4050 | 0.2025 | `tensor-parallel-size=2`, `data-parallel-size=1`, `max-num-batched-tokens=12288` |
|
||||||
|
| harness | 8 | 1.0967 | 0.2742 | `tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=16384` |
|
||||||
|
|
||||||
|
Harness reached a higher incumbent and did so earlier. Final best request rate per GPU improved by about `35.4%` over the clean no-harness rerun.
|
||||||
|
|
||||||
|
## Incumbent Curve
|
||||||
|
|
||||||
|
Values are incumbent best request rate per GPU after each tuning iteration.
|
||||||
|
|
||||||
|
| Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|
||||||
|
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
|
||||||
|
| no-harness rerun | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 |
|
||||||
|
| harness | 0.0650 | 0.0650 | 0.2025 | 0.2025 | 0.2025 | 0.2025 | 0.2696 | 0.2742 | 0.2742 | 0.2742 | 0.2742 | stop |
|
||||||
|
|
||||||
|
## Trial Details
|
||||||
|
|
||||||
|
No-harness rerun:
|
||||||
|
|
||||||
|
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|
||||||
|
| ---: | ---: | ---: | --- | --- |
|
||||||
|
| 1 | 0.0650 | 0.0650 | completed | baseline |
|
||||||
|
| 2 | 0.0617 | 0.0650 | completed | `tp=1`, `dp=2`, `max-num-batched-tokens=12288` |
|
||||||
|
| 3 | 0.0308 | 0.0650 | completed | `tp=1`, `dp=4` |
|
||||||
|
| 4 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=12288` |
|
||||||
|
| 5 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2`, `max-num-batched-tokens=16384` |
|
||||||
|
| 6 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2`, `max-num-batched-tokens=12288`, `block-size=32` |
|
||||||
|
| 7 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=10240` |
|
||||||
|
| 8 | - | 0.0650 | completed, infeasible | `max-num-batched-tokens=7168` |
|
||||||
|
| 9 | - | 0.0650 | completed, infeasible | `tp=1`, `dp=2` |
|
||||||
|
| 10 | 0.2025 | 0.2025 | completed | `tp=2`, `dp=1`, `max-num-batched-tokens=12288` |
|
||||||
|
| 11 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-batched-tokens=10240` |
|
||||||
|
| 12 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-batched-tokens=13312` |
|
||||||
|
|
||||||
|
Harness:
|
||||||
|
|
||||||
|
| Iter | Trial result / GPU | Incumbent / GPU | Status | Config summary |
|
||||||
|
| ---: | ---: | ---: | --- | --- |
|
||||||
|
| 1 | 0.0650 | 0.0650 | completed | baseline |
|
||||||
|
| 2 | 0.0617 | 0.0650 | completed | `tp=1`, `dp=2` |
|
||||||
|
| 3 | 0.2025 | 0.2025 | completed | `tp=2`, `dp=1` |
|
||||||
|
| 4 | - | 0.2025 | completed, infeasible | `tp=2`, chunked prefill, `max-num-batched-tokens=16384` |
|
||||||
|
| 5 | 0.1283 | 0.2025 | completed | `tp=2`, `dp=2` |
|
||||||
|
| 6 | - | 0.2025 | completed, infeasible | `tp=2`, `dp=1`, `max-num-seqs=4` |
|
||||||
|
| 7 | 0.2696 | 0.2696 | completed | `tp=4`, `dp=1` |
|
||||||
|
| 8 | 0.2742 | 0.2742 | completed | `tp=4`, chunked prefill, `max-num-batched-tokens=16384` |
|
||||||
|
| 9 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=24576` |
|
||||||
|
| 10 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=8` |
|
||||||
|
| 11 | - | 0.2742 | completed, infeasible | `tp=4`, chunked prefill, `max-num-batched-tokens=16384`, `max-num-seqs=16` |
|
||||||
|
| 12 | - | 0.2742 | harness stop | validation exhausted after strong incumbent |
|
||||||
|
|
||||||
|
## Interpretation
|
||||||
|
|
||||||
|
The clean no-harness rerun eventually found the `tp=2` topology at iter 10, so the old migration-tainted no-harness result was indeed too pessimistic. Harness still improves the process in two ways:
|
||||||
|
|
||||||
|
- It reaches the `tp=2` topology by iter 3 instead of iter 10.
|
||||||
|
- It then escalates to `tp=4` and a nearby batching refinement, reaching `0.2742 req/s/GPU`.
|
||||||
|
|
||||||
|
The harness effect is not "one iter to best"; it is directional search. It turns bottleneck evidence into topology validation probes, then validates runtime refinements around the stronger incumbent and stops when further nearby probes do not improve.
|
||||||
Reference in New Issue
Block a user