3.0 KiB
3.0 KiB
qwen235b-thinking-decode
qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), SLO: p95-equivalent pass target 95%, TPOT <= 40ms, TTFT not enforced.
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology/state.json - Log:
/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.log - Spec:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.json
Best result
- Best trial:
trial-0009 - Best config:
tensor-parallel-size=2data-parallel-size=4expert-parallel-size=8max-num-seqs=128max-num-batched-tokens=256
- Best
sampling_u:0.013218402863 - Best request rate:
0.2816666666666667 req/s - Best pass rate:
0.9704142011834319
Compared with baseline:
trial-0001:0.12666666666666668 req/strial-0009:0.2816666666666667 req/s- Throughput gain:
2.22x
Best-point latency:
TPOT mean/p50/p90/p95/p99 = 27.49 / 26.50 / 38.73 / 39.57 / 40.73 msTTFT mean/p50/p90/p95/p99 = 216.33 / 199.62 / 298.56 / 308.91 / 317.00 ms
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline | 0.1267 req/s, feasible |
trial-0002 |
TP=2, DP=4 |
0.2450 req/s, feasible |
trial-0003 |
TP=1, DP=8, EP=8 |
infeasible |
trial-0004 |
TP=2, DP=4, EP=4 |
launch fail |
trial-0005 |
gpu-memory-utilization=0.8, max-num-seqs=256 |
infeasible |
trial-0006 |
max-num-seqs=128 |
infeasible |
trial-0007 |
block-size=128 |
infeasible |
trial-0008 |
max-num-batched-tokens=384 |
infeasible |
trial-0009 |
TP=2, DP=4, EP=8, max-num-seqs=128, max-num-batched-tokens=256 |
0.2817 req/s, feasible, best |
trial-0010 |
trial-0009 + block-size=128 |
infeasible |
trial-0011 |
TP=1, DP=8, EP=8, max-num-seqs=128, max-num-batched-tokens=256 |
infeasible |
trial-0012 |
TP=2, DP=4, EP=8, max-num-seqs=96, max-num-batched-tokens=192 |
infeasible |
Key insights
- Removing
VLLM_USE_FLASHINFER_SAMPLER,CUDA_DEVICE_MAX_CONNECTIONS,VLLM_ENABLE_TBO_OPTfrom tunable space did not block progress; the win came fromparallel space. - The first strong improvement was topology:
TP4/DP2/EP8 -> TP2/DP4/EP8. TP1/DP8/EP8launched, but did not beatTP2/DP4/EP8.EP4underTP2/DP4failed at launch and should be treated as negative evidence for this stack.- After topology settled at
TP2/DP4/EP8, the useful runtime refinement was tighter decode batching:max-num-seqs=128,max-num-batched-tokens=256.
Current recommendation
Use the trial-0009 shape as the default decode-only serving config for this workload:
tensor-parallel-size=2data-parallel-size=4expert-parallel-size=8max-num-seqs=128max-num-batched-tokens=256