qwen235b-thinking-prefill-ttft-1s-2s-0-32k

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on the 0~32k input bucket under a stricter stepped TTFT SLO.

Setup

Hardware: dash1, 8x H20
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
Baseline topology: TP=4, DP=1, EP=1
Trace: thinking_w20260327_1000
Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
Window duration: 600s (10:00-10:10, 2026-03-27)
Request mode: chat
Replay override: min_tokens=max_tokens=1
Input bucket: 0 <= input_length <= 32768
SLO:
- pass target: 95%
- TTFT <= 1000ms for <=8191 input tokens
- TTFT <= 2000ms for <=32767 input tokens
- TTFT <= 2000ms fallback bucket
Search:
- sampling_u in [0, 0.125]
- max_probes = 6
- 12 trials total
Proposal model: codex / gpt-5.4

Run assets

Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology
State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology/state.json
Log: /home/admin/cpfs/wjh/aituner/aituner/logs/q235b_prefill_1s2s_0_32k.log
Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_run5_ttft_1s_2s_0_32k.json

Best result

Best trial: trial-0011
Best config:
- tensor-parallel-size=8
- data-parallel-size=1
- enable-expert-parallel=false
- max-num-batched-tokens=4096
- max-num-seqs=16
- VLLM_ENABLE_TORCH_COMPILE=0
Best sampling_u: 0.073767628521
Best request rate: 1.8516666666666666 req/s
Best request rate per GPU: 0.23145833333333332 req/s/gpu
Best pass rate: 0.9558955895589559

Compared with baseline:

trial-0001: 0.47 req/s, 0.1175 req/s/gpu
trial-0011: 1.8516666666666666 req/s, 0.23145833333333332 req/s/gpu
Raw throughput gain: 3.94x
Per-GPU throughput gain: 1.97x

Best-point latency:

baseline trial-0001 TTFT mean/p50/p90/p95/p99 = 236.68 / 75.19 / 294.39 / 1378.79 / 3118.86 ms
best trial-0011 TTFT mean/p50/p90/p95/p99 = 223.70 / 65.67 / 261.69 / 1065.31 / 3648.34 ms

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline `TP4/DP1/EP-off`, compile on	`0.4700 req/s`, feasible
`trial-0002`	`TP4/DP2`, `EP-off`	probe-search failure
`trial-0003`	`TP4/DP1/EP4`, `max-num-batched-tokens=4096`	launch fail
`trial-0004`	`VLLM_ENABLE_TORCH_COMPILE=0`, `max-num-batched-tokens=6144`	infeasible
`trial-0005`	compile off, `max-num-batched-tokens=4096`	infeasible
`trial-0006`	compile off, `max-num-seqs=32`	infeasible
`trial-0007`	compile off, `TP8/DP1/EP-off`	`1.3817 req/s`, feasible
`trial-0008`	`trial-0007 + max-num-seqs=32`	`1.5983 req/s`, feasible
`trial-0009`	`trial-0008 + max-num-batched-tokens=6144`	`1.8017 req/s`, feasible
`trial-0010`	`trial-0008 + max-num-batched-tokens=4096`	`1.8300 req/s`, feasible
`trial-0011`	`trial-0010 + max-num-seqs=16`	`1.8517 req/s`, feasible, best
`trial-0012`	`trial-0011 + max-num-batched-tokens=3072`	infeasible

Key insights

Under the stricter 1s/2s TTFT SLO, the main win still came from topology first: TP4 -> TP8.
TP4/DP2 and EP4 remain negative evidence in this stack. The former failed in probe search; the latter failed at engine launch.
Runtime-only tuning inside the 4-GPU topology did not beat baseline at all. The useful search space opened only after moving to TP8/DP1.
After the TP8 switch, the winning runtime shape became more conservative than the looser prefill studies: max-num-batched-tokens=4096 and max-num-seqs=16.
This run shows that even under a much tighter TTFT target, the TP8 shape still improves both raw throughput and per-GPU throughput materially over baseline.

Recommendation

For qwen235b thinking prefill-only on the 0~32k bucket under the 1s/2s stepped TTFT SLO, use:

tensor-parallel-size=8
data-parallel-size=1
enable-expert-parallel=false
max-num-batched-tokens=4096
max-num-seqs=16
VLLM_ENABLE_TORCH_COMPILE=0

Keep the rest of the run_qwen235b.sh baseline unchanged.

4.3 KiB Raw Blame History