qwen235b-thinking-decode-0323

qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260323_1000 with TPOT <= 40ms.

Setup

Hardware: dash2, 8x H20
Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
Engine: internal vLLM, decode-only mode with --kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}
Baseline topology: TP=4, DP=2, EP=8
Trace: thinking_w20260323_1000
Trace source: trace_windows/traces/thinking_w20260323_1000.jsonl
Window duration: 600s (10:00-10:10, 2026-03-23)
Request mode: decode_only
SLO:
- pass target: 95%
- TPOT <= 40ms
- TTFT not enforced
Search:
- sampling_u in [0, 0.125]
- max_probes = 6
- 12 trials total
Proposal model: codex / gpt-5.4

Run assets

Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology
State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json
Spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json

Best result

Best trial: trial-0007
Best config delta:
- gpu-memory-utilization=0.86
Best topology stayed unchanged:
- tensor-parallel-size=4
- data-parallel-size=2
- expert-parallel-size=8
Best sampling_u: 0.033736228943
Best request rate: 0.48333333333333334 req/s
Best request rate per GPU: 0.06041666666666667 req/s/gpu
Best pass rate: 0.9551724137931035

Compared with baseline:

trial-0001: 0.43333333333333335 req/s, 0.05416666666666667 req/s/gpu
trial-0007: 0.48333333333333334 req/s, 0.06041666666666667 req/s/gpu
Throughput gain: 1.12x

Best-point latency:

TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline `TP4/DP2/EP8`	`0.4333 req/s`, feasible
`trial-0002`	`EP=4`	launch fail
`trial-0003`	`gpu-memory-utilization=0.8`, `max-num-seqs=224`	infeasible
`trial-0004`	`gpu-memory-utilization=0.8`	infeasible
`trial-0005`	`gpu-memory-utilization=0.82`	`0.4650 req/s`, feasible
`trial-0006`	`gpu-memory-utilization=0.84`	infeasible
`trial-0007`	`gpu-memory-utilization=0.86`	`0.4833 req/s`, feasible, best
`trial-0008`	`gpu-memory-utilization=0.87`	infeasible
`trial-0009`	`gpu-memory-utilization=0.86`, `block-size=32`	launch fail
`trial-0010`	`gpu-memory-utilization=0.86`, `max-num-seqs=208`	infeasible
`trial-0011`	`gpu-memory-utilization=0.86`, `max-num-seqs=176`	infeasible
`trial-0012`	`gpu-memory-utilization=0.86`, `max-num-batched-tokens=896`	infeasible

Key insights

This 0323 window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change.
EP=4 was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.
The best point is very close to the SLO edge: TPOT p95 ~= 39.55ms, so the remaining headroom is small.
Compared with the heavier 0327 decode-only tuning, 0323 is a milder window: baseline is already strong, and tuning only adds about 11.5%.

Recommendation

For a 0323-like decode-only thinking window, keep the baseline TP4/DP2/EP8 topology and use:

gpu-memory-utilization=0.86

Do not treat this run as evidence that 0323 prefers the same topology changes as 0327; this study mainly supports a small residency-headroom refinement.

3.7 KiB Raw Blame History