Files
aituner/docs/qwen235b-thinking-decode/README.md

3.0 KiB

qwen235b-thinking-decode

qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), SLO: p95-equivalent pass target 95%, TPOT <= 40ms, TTFT not enforced.

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology/state.json
  • Log: /home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.log
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.json

Best result

  • Best trial: trial-0009
  • Best config:
    • tensor-parallel-size=2
    • data-parallel-size=4
    • expert-parallel-size=8
    • max-num-seqs=128
    • max-num-batched-tokens=256
  • Best sampling_u: 0.013218402863
  • Best request rate: 0.2816666666666667 req/s
  • Best pass rate: 0.9704142011834319

Compared with baseline:

  • trial-0001: 0.12666666666666668 req/s
  • trial-0009: 0.2816666666666667 req/s
  • Throughput gain: 2.22x

Best-point latency:

  • TPOT mean/p50/p90/p95/p99 = 27.49 / 26.50 / 38.73 / 39.57 / 40.73 ms
  • TTFT mean/p50/p90/p95/p99 = 216.33 / 199.62 / 298.56 / 308.91 / 317.00 ms

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline 0.1267 req/s, feasible
trial-0002 TP=2, DP=4 0.2450 req/s, feasible
trial-0003 TP=1, DP=8, EP=8 infeasible
trial-0004 TP=2, DP=4, EP=4 launch fail
trial-0005 gpu-memory-utilization=0.8, max-num-seqs=256 infeasible
trial-0006 max-num-seqs=128 infeasible
trial-0007 block-size=128 infeasible
trial-0008 max-num-batched-tokens=384 infeasible
trial-0009 TP=2, DP=4, EP=8, max-num-seqs=128, max-num-batched-tokens=256 0.2817 req/s, feasible, best
trial-0010 trial-0009 + block-size=128 infeasible
trial-0011 TP=1, DP=8, EP=8, max-num-seqs=128, max-num-batched-tokens=256 infeasible
trial-0012 TP=2, DP=4, EP=8, max-num-seqs=96, max-num-batched-tokens=192 infeasible

Key insights

  • Removing VLLM_USE_FLASHINFER_SAMPLER, CUDA_DEVICE_MAX_CONNECTIONS, VLLM_ENABLE_TBO_OPT from tunable space did not block progress; the win came from parallel space.
  • The first strong improvement was topology: TP4/DP2/EP8 -> TP2/DP4/EP8.
  • TP1/DP8/EP8 launched, but did not beat TP2/DP4/EP8.
  • EP4 under TP2/DP4 failed at launch and should be treated as negative evidence for this stack.
  • After topology settled at TP2/DP4/EP8, the useful runtime refinement was tighter decode batching: max-num-seqs=128, max-num-batched-tokens=256.

Current recommendation

Use the trial-0009 shape as the default decode-only serving config for this workload:

  • tensor-parallel-size=2
  • data-parallel-size=4
  • expert-parallel-size=8
  • max-num-seqs=128
  • max-num-batched-tokens=256