3.3 KiB
3.3 KiB
qwen235b-thinking-prefill-7day-compare
qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, comparing 3 configs across 7 daily 10:00-10:10 windows by request_rate_per_gpu.
Setup
- Hardware:
dash1,8x H20 - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Engine: internal vLLM, baseline aligned to
~/run_qwen235b.sh - Trace set:
thinking_w20260321_1000tothinking_w20260327_1000 - Window duration:
600seach - Request mode:
chat - Replay override:
min_tokens=max_tokens=1 - SLO:
- pass target:
95% TTFT <= 2000msfor<=8191input tokensTTFT <= 4000msfor<=32767input tokensTTFT <= 6000msfor>32767input tokens
- pass target:
- Search:
- each candidate independently binary-searches its own
sampling_u sampling_u in [0, 0.125]max_probes = 6
- each candidate independently binary-searches its own
Candidates
baselineTP=4, DP=1, EP=off- baseline
run_qwen235b.shshape
tuned_0323- tuned on
thinking_w20260323_1000 TP=4, DP=1, EP=offmax-num-batched-tokens=3072max-num-seqs=32
- tuned on
tuned_0327- tuned on
thinking_w20260327_1000 TP=8, DP=1, EP=offmax-num-batched-tokens=6144max-num-seqs=48block-size=32
- tuned on
Run assets
- Compare root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327 - Summary:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/summary.json - Report:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/report.md - Compare spec:
/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_7day_compare.json
Aggregate result
- Wins by
request_rate_per_gpu:tuned_0327:5 / 7baseline:2 / 7tuned_0323:0 / 7
- Mean
request_rate_per_gpu:baseline:0.13845tuned_0323:0.12756tuned_0327:0.17232
- Relative to baseline:
tuned_0323:0.92xmean per-GPU throughputtuned_0327:1.24xmean per-GPU throughput
Per-day result
| Date | baseline req/s/gpu | tuned_0323 req/s/gpu | tuned_0327 req/s/gpu | Winner |
|---|---|---|---|---|
2026-03-21 |
0.08500 |
0.03917 |
0.14375 |
tuned_0327 |
2026-03-22 |
0.10125 |
0.12083 |
0.15313 |
tuned_0327 |
2026-03-23 |
0.12792 |
0.12792 |
0.19167 |
tuned_0327 |
2026-03-24 |
0.09000 |
0.09583 |
0.11250 |
tuned_0327 |
2026-03-25 |
0.13792 |
0.13208 |
0.13146 |
baseline |
2026-03-26 |
0.32000 |
0.25917 |
0.23375 |
baseline |
2026-03-27 |
0.10708 |
0.11792 |
0.24000 |
tuned_0327 |
Key takeaways
tuned_0327is the only candidate with clear cross-day value. It wins5/7windows and improves mean per-GPU throughput by about24%.tuned_0323does not generalize. It is slightly more conservative and keeps high pass rate, but mean per-GPU throughput is below baseline.- The
0327winner is not universal. On2026-03-25and especially2026-03-26, the 4-GPU baseline is more efficient per GPU than theTP8tuned shape. - The practical reading is that prefill-only tuning has workload-regime sensitivity.
TP8 + 6144 + 48 + block-size=32is a strong default candidate, but not a global static optimum across all days.