qwen235b-thinking-prefill-7day-compare

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, comparing 3 configs across 7 daily 10:00-10:10 windows by request_rate_per_gpu.

Setup

Hardware: dash1, 8x H20
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
Trace set: thinking_w20260321_1000 to thinking_w20260327_1000
Window duration: 600s each
Request mode: chat
Replay override: min_tokens=max_tokens=1
SLO:
- pass target: 95%
- TTFT <= 2000ms for <=8191 input tokens
- TTFT <= 4000ms for <=32767 input tokens
- TTFT <= 6000ms for >32767 input tokens
Search:
- each candidate independently binary-searches its own sampling_u
- sampling_u in [0, 0.125]
- max_probes = 6

Candidates

baseline
- TP=4, DP=1, EP=off
- baseline run_qwen235b.sh shape
tuned_0323
- tuned on thinking_w20260323_1000
- TP=4, DP=1, EP=off
- max-num-batched-tokens=3072
- max-num-seqs=32
tuned_0327
- tuned on thinking_w20260327_1000
- TP=8, DP=1, EP=off
- max-num-batched-tokens=6144
- max-num-seqs=48
- block-size=32

Run assets

Compare root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327
Summary: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/summary.json
Report: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/report.md
Compare spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_7day_compare.json

Aggregate result

Wins by request_rate_per_gpu:
- tuned_0327: 5 / 7
- baseline: 2 / 7
- tuned_0323: 0 / 7
Mean request_rate_per_gpu:
- baseline: 0.13845
- tuned_0323: 0.12756
- tuned_0327: 0.17232
Relative to baseline:
- tuned_0323: 0.92x mean per-GPU throughput
- tuned_0327: 1.24x mean per-GPU throughput

Per-day result

Date	baseline req/s/gpu	tuned_0323 req/s/gpu	tuned_0327 req/s/gpu	Winner
`2026-03-21`	`0.08500`	`0.03917`	`0.14375`	`tuned_0327`
`2026-03-22`	`0.10125`	`0.12083`	`0.15313`	`tuned_0327`
`2026-03-23`	`0.12792`	`0.12792`	`0.19167`	`tuned_0327`
`2026-03-24`	`0.09000`	`0.09583`	`0.11250`	`tuned_0327`
`2026-03-25`	`0.13792`	`0.13208`	`0.13146`	`baseline`
`2026-03-26`	`0.32000`	`0.25917`	`0.23375`	`baseline`
`2026-03-27`	`0.10708`	`0.11792`	`0.24000`	`tuned_0327`

Key takeaways

tuned_0327 is the only candidate with clear cross-day value. It wins 5/7 windows and improves mean per-GPU throughput by about 24%.
tuned_0323 does not generalize. It is slightly more conservative and keeps high pass rate, but mean per-GPU throughput is below baseline.
The 0327 winner is not universal. On 2026-03-25 and especially 2026-03-26, the 4-GPU baseline is more efficient per GPU than the TP8 tuned shape.
The practical reading is that prefill-only tuning has workload-regime sensitivity. TP8 + 6144 + 48 + block-size=32 is a strong default candidate, but not a global static optimum across all days.

3.3 KiB Raw Permalink Blame History

qwen235b-thinking-prefill-7day-compare

Setup

Candidates

Run assets

Aggregate result

Per-day result

Key takeaways

3.3 KiB

Raw Permalink Blame History