Files
aituner/docs/qwen235b-thinking-prefill/ttft-1s-2s-0-32k.md

4.3 KiB

qwen235b-thinking-prefill-ttft-1s-2s-0-32k

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on the 0~32k input bucket under a stricter stepped TTFT SLO.

Setup

  • Hardware: dash1, 8x H20
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
  • Baseline topology: TP=4, DP=1, EP=1
  • Trace: thinking_w20260327_1000
  • Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
  • Window duration: 600s (10:00-10:10, 2026-03-27)
  • Request mode: chat
  • Replay override: min_tokens=max_tokens=1
  • Input bucket: 0 <= input_length <= 32768
  • SLO:
    • pass target: 95%
    • TTFT <= 1000ms for <=8191 input tokens
    • TTFT <= 2000ms for <=32767 input tokens
    • TTFT <= 2000ms fallback bucket
  • Search:
    • sampling_u in [0, 0.125]
    • max_probes = 6
    • 12 trials total
  • Proposal model: codex / gpt-5.4

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology/state.json
  • Log: /home/admin/cpfs/wjh/aituner/aituner/logs/q235b_prefill_1s2s_0_32k.log
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_run5_ttft_1s_2s_0_32k.json

Best result

  • Best trial: trial-0011
  • Best config:
    • tensor-parallel-size=8
    • data-parallel-size=1
    • enable-expert-parallel=false
    • max-num-batched-tokens=4096
    • max-num-seqs=16
    • VLLM_ENABLE_TORCH_COMPILE=0
  • Best sampling_u: 0.073767628521
  • Best request rate: 1.8516666666666666 req/s
  • Best request rate per GPU: 0.23145833333333332 req/s/gpu
  • Best pass rate: 0.9558955895589559

Compared with baseline:

  • trial-0001: 0.47 req/s, 0.1175 req/s/gpu
  • trial-0011: 1.8516666666666666 req/s, 0.23145833333333332 req/s/gpu
  • Raw throughput gain: 3.94x
  • Per-GPU throughput gain: 1.97x

Best-point latency:

  • baseline trial-0001 TTFT mean/p50/p90/p95/p99 = 236.68 / 75.19 / 294.39 / 1378.79 / 3118.86 ms
  • best trial-0011 TTFT mean/p50/p90/p95/p99 = 223.70 / 65.67 / 261.69 / 1065.31 / 3648.34 ms

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline TP4/DP1/EP-off, compile on 0.4700 req/s, feasible
trial-0002 TP4/DP2, EP-off probe-search failure
trial-0003 TP4/DP1/EP4, max-num-batched-tokens=4096 launch fail
trial-0004 VLLM_ENABLE_TORCH_COMPILE=0, max-num-batched-tokens=6144 infeasible
trial-0005 compile off, max-num-batched-tokens=4096 infeasible
trial-0006 compile off, max-num-seqs=32 infeasible
trial-0007 compile off, TP8/DP1/EP-off 1.3817 req/s, feasible
trial-0008 trial-0007 + max-num-seqs=32 1.5983 req/s, feasible
trial-0009 trial-0008 + max-num-batched-tokens=6144 1.8017 req/s, feasible
trial-0010 trial-0008 + max-num-batched-tokens=4096 1.8300 req/s, feasible
trial-0011 trial-0010 + max-num-seqs=16 1.8517 req/s, feasible, best
trial-0012 trial-0011 + max-num-batched-tokens=3072 infeasible

Key insights

  • Under the stricter 1s/2s TTFT SLO, the main win still came from topology first: TP4 -> TP8.
  • TP4/DP2 and EP4 remain negative evidence in this stack. The former failed in probe search; the latter failed at engine launch.
  • Runtime-only tuning inside the 4-GPU topology did not beat baseline at all. The useful search space opened only after moving to TP8/DP1.
  • After the TP8 switch, the winning runtime shape became more conservative than the looser prefill studies: max-num-batched-tokens=4096 and max-num-seqs=16.
  • This run shows that even under a much tighter TTFT target, the TP8 shape still improves both raw throughput and per-GPU throughput materially over baseline.

Recommendation

For qwen235b thinking prefill-only on the 0~32k bucket under the 1s/2s stepped TTFT SLO, use:

  • tensor-parallel-size=8
  • data-parallel-size=1
  • enable-expert-parallel=false
  • max-num-batched-tokens=4096
  • max-num-seqs=16
  • VLLM_ENABLE_TORCH_COMPILE=0

Keep the rest of the run_qwen235b.sh baseline unchanged.