qwen235b-thinking-prefill

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.

Setup

Hardware: dash0, 8x H20
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
Baseline topology: TP=4, DP=1, EP=1
Trace: thinking_w20260327_1000
Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
Window duration: 600s (10:00-10:10, 2026-03-27)
Request mode: chat
Replay override: min_tokens=max_tokens=1
SLO:
- pass target: 95%
- TTFT <= 3000ms for <=4096 input tokens
- TTFT <= 6000ms for <=32768 input tokens
- TTFT <= 9000ms for >32768 input tokens
Search:
- sampling_u in [0, 0.125]
- max_probes = 6
- 12 trials total
Proposal model: codex / gpt-5.4

Run assets

Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology
State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology/state.json
Log: /home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_prefill_thinking_run1_ttft_topology.log
Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run1_ttft.json

Best result

Best trial: trial-0010
Best config:
- tensor-parallel-size=8
- data-parallel-size=1
- enable-expert-parallel=false
- max-num-batched-tokens=3712
Best sampling_u: 0.120422363281
Best request rate: 3.035 req/s
Best request rate per GPU: 0.379375 req/s/gpu
Best pass rate: 0.9533223503569467

Compared with baseline:

trial-0001: 0.8116666666666666 req/s, 0.20291666666666666 req/s/gpu
trial-0010: 3.035 req/s, 0.379375 req/s/gpu
Raw throughput gain: 3.74x
Per-GPU throughput gain: 1.87x

Best-point latency:

TTFT mean/p50/p90/p95/p99 = 863.84 / 253.58 / 2392.48 / 3154.26 / 5377.00 ms

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline `TP4/DP1/EP-off`, `max-num-batched-tokens=8192`	`0.8117 req/s`, feasible
`trial-0002`	`DP=2`, `max-num-batched-tokens=4096`	probe-time runtime failure
`trial-0003`	`DP=2`, `max-num-batched-tokens=8192`	probe-time runtime failure
`trial-0004`	`EP=4`, `enable-expert-parallel=true`	launch fail
`trial-0005`	`max-num-batched-tokens=4096`	infeasible
`trial-0006`	`TP=8, DP=1`, `max-num-batched-tokens=4096`	`2.8600 req/s`, feasible
`trial-0007`	`trial-0006 + max-num-batched-tokens=3072`	infeasible
`trial-0008`	`trial-0006 + max-num-batched-tokens=3584`	`2.9667 req/s`, feasible
`trial-0009`	`trial-0006 + max-num-batched-tokens=3328`	infeasible
`trial-0010`	`trial-0006 + max-num-batched-tokens=3712`	`3.0350 req/s`, feasible, best
`trial-0011`	`trial-0010 + max-num-batched-tokens=3840`	infeasible
`trial-0012`	`trial-0010 + max-num-batched-tokens=3776`	infeasible

Key insights

The main win came from topology first, then local batch-shape refinement. TP4 -> TP8 was the key change.
TP4/DP2 was not just suboptimal; it was unstable at runtime under probing and should be treated as negative evidence for this stack.
EP=4 on the baseline TP4/DP1 path failed at launch with group_gemm's contiguous kernel requires deepgemm, so EP is currently not a viable direction here.
After switching to TP8/DP1/EP-off, the remaining gain came from tightening max-num-batched-tokens from 4096 into a narrow sweet spot around 3584~3712.
Too-small prefill batches (3072, 3328) and too-large ones (3776, 3840) both hurt the TTFT tail enough to lose the 95% target.

Current recommendation

Use trial-0010 as the default serving shape for this workload:

tensor-parallel-size=8
data-parallel-size=1
enable-expert-parallel=false
max-num-batched-tokens=3712
keep the rest of the run_qwen235b.sh baseline unchanged

4.0 KiB Raw Blame History