Files

Gahow Wang 984eb1f325 Document 8-GPU harness ablation results for qwen27b and qwen235b prefill

Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)

Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.

2026-05-16 21:23:16 +08:00

6.2 KiB

Raw Blame History

Qwen27B Chat 0-8k Harness Ablation (8-GPU)

Date: 2026-05-13

Supersedes: qwen27b-chat-0-8k-ttft4s-tpot25-20260510.md (7-GPU / gpu3skip setup).

Setup

Host: dash0
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
Workload: chat_w20260311_1000, chat, 0-8k input window
SLO: TTFT <= 4000ms and TPOT <= 25ms, target pass rate = 0.95
Trial budget: 12 total tuning iterations per study
Search: sampling_u in [0, 0.0625], tolerance 0.001, max probes 6
Execution: python3 -m aituner.cli study tune ... --max-trials 12
GPU env: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 (8x H20)
Baseline topology: TP=1
LLM: gpt-5.4
Code: profile-driven harness planner, post GPU-visibility fix (5c2958e+)

Studies

Variant	Study ID
no-harness	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-noharness-minprompt-gpt54-20260513`
harness	`dash0-qwen27b-chat-0-8k-ttft4s-tpot25-gpu8-12iter-harness-profileplanner-20260513`

Result

Raw per-iteration performance for Fig18-style plot. Metric: best_request_rate_per_gpu from that trial's own result.json. NA means the proposed config did not produce a feasible point. fail means engine launch failure.

Variant	iter1	iter2	iter3	iter4	iter5	iter6	iter7	iter8	iter9	iter10	iter11	iter12
no-harness raw `perf[i]`	0.0650	fail	fail	0.0617	0.0650	0.1233	0.1050	0.1233	0.0650	0.0650	0.0617	0.1233
harness raw `perf[i]`	0.0650	0.1992	0.2621	0.2056	0.1544	0.2696	0.2621	0.2621	0.2696	0.2621	0.2621	0.2621

Variant	Best iter	Best request rate	Best request rate / GPU	Best config summary
no-harness	6	0.1233	0.1233	`enable-prefix-caching=false`
harness	6	1.0783	0.2696	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`

Harness final best is +118.6% over no-harness.

Incumbent Curve

Best-so-far request rate per GPU after each iteration.

Variant	1	2	3	4	5	6	7	8	9	10	11	12
no-harness	0.0650	0.0650	0.0650	0.0650	0.0650	0.1233	0.1233	0.1233	0.1233	0.1233	0.1233	0.1233
harness	0.0650	0.1992	0.2621	0.2621	0.2621	0.2696	0.2696	0.2696	0.2696	0.2696	0.2696	0.2696

Trial Details

No-harness:

Iter	Result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	-	0.0650	launch fail	`gpu-memory-utilization=0.94`, `max-num-batched-tokens=16384`
3	-	0.0650	launch fail	`enable-chunked-prefill=false`
4	0.0617	0.0650	completed	`data-parallel-size=2`
5	0.0650	0.0650	completed	`block-size=32`
6	0.1233	0.1233	completed	`enable-prefix-caching=false`
7	0.1050	0.1233	completed	`enable-prefix-caching=false`, `block-size=32`
8	0.1233	0.1233	completed	`enable-prefix-caching=false`, `max-num-seqs=32`
9	0.0650	0.1233	completed	`enable-prefix-caching=false`, `max-num-batched-tokens=4096`
10	0.0650	0.1233	completed	`enable-prefix-caching=false`, `max-num-seqs=16`
11	0.0617	0.1233	completed	`data-parallel-size=2`, `enable-prefix-caching=false`
12	0.1233	0.1233	completed	`enable-prefix-caching=false` (+ torch compile off)

Harness:

Iter	Result / GPU	Incumbent / GPU	Status	Config summary
1	0.0650	0.0650	completed	baseline
2	0.1992	0.1992	completed	`tensor-parallel-size=2`
3	0.2621	0.2621	completed	`tensor-parallel-size=4`
4	0.2056	0.2621	completed	`tensor-parallel-size=8`
5	0.1544	0.2621	completed	`tensor-parallel-size=4`, `data-parallel-size=2`
6	0.2696	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`
7	0.2621	0.2696	completed	`tensor-parallel-size=4`, `enable-chunked-prefill=true`, `max-num-batched-tokens=12288`
8	0.2621	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7424`
9	0.2696	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=64`
10	0.2621	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=56`
11	0.2621	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=60`
12	0.2621	0.2696	completed	`tensor-parallel-size=4`, `max-num-batched-tokens=7680`, `max-num-seqs=63`

Interpretation

No-harness never tested any TP change in 12 trials. It started from TP=1, encountered two early launch failures, then spent all remaining budget on runtime knobs (enable-prefix-caching, block-size, max-num-seqs, max-num-batched-tokens). Its best discovery was disabling prefix caching at iter 6, reaching only 0.1233 req/s/GPU.

Harness systematically explored the TP frontier: iter 2 tested TP=2, iter 3 tested TP=4, iter 4 tested TP=8. The profile-driven planner identified ttft_prefill as the ranked bottleneck and proposed increasing TP as the primary relief action. After TP=4 proved best per-GPU, the harness tested TP=4/DP=2 (worse) then shifted to runtime refinement within the TP=4 family, settling on max-num-batched-tokens=7680 as the marginal improvement.

The result demonstrates that topology exploration is critical for this workload: the no-harness LLM failed to discover TP>1 configurations entirely, while the harness reached the optimal TP=4 topology by iter 3 and refined it by iter 6.

Comparison with Previous 7-GPU Run

The 7-GPU (gpu3skip) run from 2026-05-10 used CUDA_VISIBLE_DEVICES=0,1,2,4,5,6,7 and is not directly comparable. The harness result on 7-GPU was 0.2742 req/s/GPU (TP=4, chunked-prefill, MBT=16384). On 8-GPU, the harness found a similar TP=4 optimum at 0.2696 req/s/GPU with slightly different runtime tuning. The core finding is consistent: harness accelerates topology discovery and significantly outperforms no-harness.

6.2 KiB Raw Blame History