qwen235b-thinking-prefill-ttft-tight-0327

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260327_1000 under tighter stepped TTFT SLO.

Setup

Hardware: dash0, 8x H20
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
Baseline topology: TP=4, DP=1, EP=1
Trace: thinking_w20260327_1000
Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
Window duration: 600s (10:00-10:10, 2026-03-27)
Request mode: chat
Replay override: min_tokens=max_tokens=1
SLO:
- pass target: 95%
- TTFT <= 2000ms for <=8191 input tokens
- TTFT <= 4000ms for <=32767 input tokens
- TTFT <= 6000ms for >32767 input tokens
Search:
- sampling_u in [0, 0.125]
- max_probes = 6
- 12 trials total
Proposal model: codex / gpt-5.4

Run assets

Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology
State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology/state.json
Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run2_ttft_tight.json

Best result

Best trial: trial-0012
Best config:
- tensor-parallel-size=8
- data-parallel-size=1
- enable-expert-parallel=false
- max-num-batched-tokens=6144
- max-num-seqs=48
- block-size=32
Best sampling_u: 0.098106384277
Best request rate: 2.4966666666666666 req/s
Best request rate per GPU: 0.3120833333333333 req/s/gpu
Best pass rate: 0.9506008010680908

Compared with baseline:

trial-0001: 0.4716666666666667 req/s, 0.11791666666666667 req/s/gpu
trial-0012: 2.4966666666666666 req/s, 0.3120833333333333 req/s/gpu
Raw throughput gain: 5.29x
Per-GPU throughput gain: 2.65x

Compared with the looser TTFT study on the same 2026-03-27 window:

looser-SLO best: 3.035 req/s, 0.379375 req/s/gpu
tighter-SLO best: 2.4966666666666666 req/s, 0.3120833333333333 req/s/gpu
throughput retained: 82.26%
throughput drop: 17.74%

Best-point latency:

TTFT mean/p50/p90/p95/p99 = 413.92 / 67.86 / 1456.32 / 2286.90 / 5326.23 ms

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline `TP4/DP1/EP-off`, `max-num-batched-tokens=8192`	`0.4717 req/s`, feasible
`trial-0002`	`TP4/DP2`	probe-search failure
`trial-0003`	`TP8/DP1/EP-off`	`1.9200 req/s`, feasible
`trial-0004`	`TP8/DP1/EP8`	launch fail
`trial-0005`	`trial-0003 + max-num-batched-tokens=6144`	`2.2517 req/s`, feasible
`trial-0006`	`trial-0003 + max-num-batched-tokens=4096`	infeasible
`trial-0007`	`trial-0003 + max-num-batched-tokens=5120`	infeasible
`trial-0008`	`trial-0003 + max-num-batched-tokens=5632`	infeasible
`trial-0009`	`trial-0005 + max-num-seqs=32`	infeasible
`trial-0010`	`trial-0005 + max-num-seqs=48`	infeasible
`trial-0011`	`trial-0005 + block-size=32`	infeasible
`trial-0012`	`trial-0005 + max-num-seqs=48, block-size=32`	`2.4967 req/s`, feasible, best

Key insights

This tuning is also on 2026-03-27, not a different day. The change is the tighter TTFT step SLO.
The best topology still moved to TP8/DP1/no-EP; tighter TTFT did not change the topology conclusion.
Tighter TTFT did change the runtime sweet spot. The best runtime shape is not the looser-study 3712 token batch, but 6144 + max-num-seqs=48 + block-size=32.
DP2 and EP remained negative evidence in this stack: TP4/DP2 failed during probing, and TP8 + EP8 failed at launch.
Relative to the looser TTFT study on the same day, stricter TTFT costs about 17.7% throughput, but the tuned result still keeps a large margin over baseline.

Recommendation

For the tighter stepped TTFT SLO on thinking_w20260327_1000, use:

tensor-parallel-size=8
data-parallel-size=1
enable-expert-parallel=false
max-num-batched-tokens=6144
max-num-seqs=48
block-size=32

Keep the rest of the run_qwen235b.sh baseline unchanged.

4.2 KiB Raw Blame History