Add completed experiment results from dash0 runs after 2026-05-13: - qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU) - qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU) Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation log with completed run status.
6.2 KiB
Qwen27B Chat 0-8k Harness Ablation (8-GPU)
Date: 2026-05-13
Supersedes: qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md (7-GPU / gpu3skip setup).
Setup
- Host:
dash0 - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Workload:
chat_w20260311_1000, chat, 0-8k input window - SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
- Trial budget: 12 total tuning iterations per study
- Search:
sampling_uin[0, 0.0625], tolerance0.001, max probes6 - Execution:
python3 -m aituner.cli study tune ... --max-trials 12 - GPU env:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7(8x H20) - Baseline topology:
TP=1 - LLM:
gpt-5.4 - Code: profile-driven harness planner, post GPU-visibility fix (
5c2958e+)
Studies
| Variant | Study ID |
|---|---|
| no-harness | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513 |
| harness | dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513 |
Result
Raw per-iteration performance for Fig18-style plot. Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point. fail means engine launch failure.
| Variant | iter1 | iter2 | iter3 | iter4 | iter5 | iter6 | iter7 | iter8 | iter9 | iter10 | iter11 | iter12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
no-harness raw perf[i] |
0.0650 | fail | fail | 0.0617 | 0.0650 | 0.1233 | 0.1050 | 0.1233 | 0.0650 | 0.0650 | 0.0617 | 0.1233 |
harness raw perf[i] |
0.0650 | 0.1992 | 0.2621 | 0.2056 | 0.1544 | 0.2696 | 0.2621 | 0.2621 | 0.2696 | 0.2621 | 0.2621 | 0.2621 |
| Variant | Best iter | Best request rate | Best request rate / GPU | Best config summary |
|---|---|---|---|---|
| no-harness | 6 | 0.1233 | 0.1233 | enable-prefix-caching=false |
| harness | 6 | 1.0783 | 0.2696 | tensor-parallel-size=4, max-num-batched-tokens=7680 |
Harness final best is +118.6% over no-harness.
Incumbent Curve
Best-so-far request rate per GPU after each iteration.
| Variant | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no-harness | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.0650 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 | 0.1233 |
| harness | 0.0650 | 0.1992 | 0.2621 | 0.2621 | 0.2621 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 | 0.2696 |
Trial Details
No-harness:
| Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | - | 0.0650 | launch fail | gpu-memory-utilization=0.94, max-num-batched-tokens=16384 |
| 3 | - | 0.0650 | launch fail | enable-chunked-prefill=false |
| 4 | 0.0617 | 0.0650 | completed | data-parallel-size=2 |
| 5 | 0.0650 | 0.0650 | completed | block-size=32 |
| 6 | 0.1233 | 0.1233 | completed | enable-prefix-caching=false |
| 7 | 0.1050 | 0.1233 | completed | enable-prefix-caching=false, block-size=32 |
| 8 | 0.1233 | 0.1233 | completed | enable-prefix-caching=false, max-num-seqs=32 |
| 9 | 0.0650 | 0.1233 | completed | enable-prefix-caching=false, max-num-batched-tokens=4096 |
| 10 | 0.0650 | 0.1233 | completed | enable-prefix-caching=false, max-num-seqs=16 |
| 11 | 0.0617 | 0.1233 | completed | data-parallel-size=2, enable-prefix-caching=false |
| 12 | 0.1233 | 0.1233 | completed | enable-prefix-caching=false (+ torch compile off) |
Harness:
| Iter | Result / GPU | Incumbent / GPU | Status | Config summary |
|---|---|---|---|---|
| 1 | 0.0650 | 0.0650 | completed | baseline |
| 2 | 0.1992 | 0.1992 | completed | tensor-parallel-size=2 |
| 3 | 0.2621 | 0.2621 | completed | tensor-parallel-size=4 |
| 4 | 0.2056 | 0.2621 | completed | tensor-parallel-size=8 |
| 5 | 0.1544 | 0.2621 | completed | tensor-parallel-size=4, data-parallel-size=2 |
| 6 | 0.2696 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7680 |
| 7 | 0.2621 | 0.2696 | completed | tensor-parallel-size=4, enable-chunked-prefill=true, max-num-batched-tokens=12288 |
| 8 | 0.2621 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7424 |
| 9 | 0.2696 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7680, max-num-seqs=64 |
| 10 | 0.2621 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7680, max-num-seqs=56 |
| 11 | 0.2621 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7680, max-num-seqs=60 |
| 12 | 0.2621 | 0.2696 | completed | tensor-parallel-size=4, max-num-batched-tokens=7680, max-num-seqs=63 |
Interpretation
No-harness never tested any TP change in 12 trials. It started from TP=1, encountered two early launch failures, then spent all remaining budget on runtime knobs (enable-prefix-caching, block-size, max-num-seqs, max-num-batched-tokens). Its best discovery was disabling prefix caching at iter 6, reaching only 0.1233 req/s/GPU.
Harness systematically explored the TP frontier: iter 2 tested TP=2, iter 3 tested TP=4, iter 4 tested TP=8. The profile-driven planner identified ttft_prefill as the ranked bottleneck and proposed increasing TP as the primary relief action. After TP=4 proved best per-GPU, the harness tested TP=4/DP=2 (worse) then shifted to runtime refinement within the TP=4 family, settling on max-num-batched-tokens=7680 as the marginal improvement.
The result demonstrates that topology exploration is critical for this workload: the no-harness LLM failed to discover TP>1 configurations entirely, while the harness reached the optimal TP=4 topology by iter 3 and refined it by iter 6.
Comparison with Previous 7-GPU Run
The 7-GPU (gpu3skip) run from 2026-05-10 used CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7 and is not directly comparable. The harness result on 7-GPU was 0.2742 req/s/GPU (TP=4, chunked-prefill, MBT=16384). On 8-GPU, the harness found a similar TP=4 optimum at 0.2696 req/s/GPU with slightly different runtime tuning. The core finding is consistent: harness accelerates topology discovery and significantly outperforms no-harness.