Files

3.3 KiB

qwen235b-thinking-prefill-7day-compare

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, comparing 3 configs across 7 daily 10:00-10:10 windows by request_rate_per_gpu.

Setup

  • Hardware: dash1, 8x H20
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
  • Trace set: thinking_w20260321_1000 to thinking_w20260327_1000
  • Window duration: 600s each
  • Request mode: chat
  • Replay override: min_tokens=max_tokens=1
  • SLO:
    • pass target: 95%
    • TTFT <= 2000ms for <=8191 input tokens
    • TTFT <= 4000ms for <=32767 input tokens
    • TTFT <= 6000ms for >32767 input tokens
  • Search:
    • each candidate independently binary-searches its own sampling_u
    • sampling_u in [0, 0.125]
    • max_probes = 6

Candidates

  • baseline
    • TP=4, DP=1, EP=off
    • baseline run_qwen235b.sh shape
  • tuned_0323
    • tuned on thinking_w20260323_1000
    • TP=4, DP=1, EP=off
    • max-num-batched-tokens=3072
    • max-num-seqs=32
  • tuned_0327
    • tuned on thinking_w20260327_1000
    • TP=8, DP=1, EP=off
    • max-num-batched-tokens=6144
    • max-num-seqs=48
    • block-size=32

Run assets

  • Compare root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327
  • Summary: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/summary.json
  • Report: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen235b-prefill-thinking-7day-baseline-vs-0323-vs-0327/report.md
  • Compare spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_7day_compare.json

Aggregate result

  • Wins by request_rate_per_gpu:
    • tuned_0327: 5 / 7
    • baseline: 2 / 7
    • tuned_0323: 0 / 7
  • Mean request_rate_per_gpu:
    • baseline: 0.13845
    • tuned_0323: 0.12756
    • tuned_0327: 0.17232
  • Relative to baseline:
    • tuned_0323: 0.92x mean per-GPU throughput
    • tuned_0327: 1.24x mean per-GPU throughput

Per-day result

Date baseline req/s/gpu tuned_0323 req/s/gpu tuned_0327 req/s/gpu Winner
2026-03-21 0.08500 0.03917 0.14375 tuned_0327
2026-03-22 0.10125 0.12083 0.15313 tuned_0327
2026-03-23 0.12792 0.12792 0.19167 tuned_0327
2026-03-24 0.09000 0.09583 0.11250 tuned_0327
2026-03-25 0.13792 0.13208 0.13146 baseline
2026-03-26 0.32000 0.25917 0.23375 baseline
2026-03-27 0.10708 0.11792 0.24000 tuned_0327

Key takeaways

  • tuned_0327 is the only candidate with clear cross-day value. It wins 5/7 windows and improves mean per-GPU throughput by about 24%.
  • tuned_0323 does not generalize. It is slightly more conservative and keeps high pass rate, but mean per-GPU throughput is below baseline.
  • The 0327 winner is not universal. On 2026-03-25 and especially 2026-03-26, the 4-GPU baseline is more efficient per GPU than the TP8 tuned shape.
  • The practical reading is that prefill-only tuning has workload-regime sensitivity. TP8 + 6144 + 48 + block-size=32 is a strong default candidate, but not a global static optimum across all days.