qwen235b-thinking-decode

qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), SLO: p95-equivalent pass target 95%, TPOT <= 40ms, TTFT not enforced.

Run assets

Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology
State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology/state.json
Log: /home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.log
Spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.json

Best result

Best trial: trial-0009
Best config:
- tensor-parallel-size=2
- data-parallel-size=4
- expert-parallel-size=8
- max-num-seqs=128
- max-num-batched-tokens=256
Best sampling_u: 0.013218402863
Best request rate: 0.2816666666666667 req/s
Best pass rate: 0.9704142011834319

Compared with baseline:

trial-0001: 0.12666666666666668 req/s
trial-0009: 0.2816666666666667 req/s
Throughput gain: 2.22x

Best-point latency:

TPOT mean/p50/p90/p95/p99 = 27.49 / 26.50 / 38.73 / 39.57 / 40.73 ms
TTFT mean/p50/p90/p95/p99 = 216.33 / 199.62 / 298.56 / 308.91 / 317.00 ms

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline	`0.1267 req/s`, feasible
`trial-0002`	`TP=2, DP=4`	`0.2450 req/s`, feasible
`trial-0003`	`TP=1, DP=8, EP=8`	infeasible
`trial-0004`	`TP=2, DP=4, EP=4`	launch fail
`trial-0005`	`gpu-memory-utilization=0.8`, `max-num-seqs=256`	infeasible
`trial-0006`	`max-num-seqs=128`	infeasible
`trial-0007`	`block-size=128`	infeasible
`trial-0008`	`max-num-batched-tokens=384`	infeasible
`trial-0009`	`TP=2, DP=4, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256`	`0.2817 req/s`, feasible, best
`trial-0010`	`trial-0009 + block-size=128`	infeasible
`trial-0011`	`TP=1, DP=8, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256`	infeasible
`trial-0012`	`TP=2, DP=4, EP=8`, `max-num-seqs=96`, `max-num-batched-tokens=192`	infeasible

Key insights

Removing VLLM_USE_FLASHINFER_SAMPLER, CUDA_DEVICE_MAX_CONNECTIONS, VLLM_ENABLE_TBO_OPT from tunable space did not block progress; the win came from parallel space.
The first strong improvement was topology: TP4/DP2/EP8 -> TP2/DP4/EP8.
TP1/DP8/EP8 launched, but did not beat TP2/DP4/EP8.
EP4 under TP2/DP4 failed at launch and should be treated as negative evidence for this stack.
After topology settled at TP2/DP4/EP8, the useful runtime refinement was tighter decode batching: max-num-seqs=128, max-num-batched-tokens=256.

Current recommendation

Use the trial-0009 shape as the default decode-only serving config for this workload:

tensor-parallel-size=2
data-parallel-size=4
expert-parallel-size=8
max-num-seqs=128
max-num-batched-tokens=256

3.0 KiB Raw Blame History