3.7 KiB
3.7 KiB
qwen235b-thinking-decode-0323
qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260323_1000 with TPOT <= 40ms.
Setup
- Hardware:
dash2,8x H20 - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Engine: internal vLLM, decode-only mode with
--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"} - Baseline topology:
TP=4, DP=2, EP=8 - Trace:
thinking_w20260323_1000 - Trace source:
trace_windows/traces/thinking_w20260323_1000.jsonl - Window duration:
600s(10:00-10:10,2026-03-23) - Request mode:
decode_only - SLO:
- pass target:
95% TPOT <= 40msTTFTnot enforced
- pass target:
- Search:
sampling_u in [0, 0.125]max_probes = 612trials total
- Proposal model:
codex / gpt-5.4
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json - Spec:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json
Best result
- Best trial:
trial-0007 - Best config delta:
gpu-memory-utilization=0.86
- Best topology stayed unchanged:
tensor-parallel-size=4data-parallel-size=2expert-parallel-size=8
- Best
sampling_u:0.033736228943 - Best request rate:
0.48333333333333334 req/s - Best request rate per GPU:
0.06041666666666667 req/s/gpu - Best pass rate:
0.9551724137931035
Compared with baseline:
trial-0001:0.43333333333333335 req/s,0.05416666666666667 req/s/gputrial-0007:0.48333333333333334 req/s,0.06041666666666667 req/s/gpu- Throughput gain:
1.12x
Best-point latency:
TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP4/DP2/EP8 |
0.4333 req/s, feasible |
trial-0002 |
EP=4 |
launch fail |
trial-0003 |
gpu-memory-utilization=0.8, max-num-seqs=224 |
infeasible |
trial-0004 |
gpu-memory-utilization=0.8 |
infeasible |
trial-0005 |
gpu-memory-utilization=0.82 |
0.4650 req/s, feasible |
trial-0006 |
gpu-memory-utilization=0.84 |
infeasible |
trial-0007 |
gpu-memory-utilization=0.86 |
0.4833 req/s, feasible, best |
trial-0008 |
gpu-memory-utilization=0.87 |
infeasible |
trial-0009 |
gpu-memory-utilization=0.86, block-size=32 |
launch fail |
trial-0010 |
gpu-memory-utilization=0.86, max-num-seqs=208 |
infeasible |
trial-0011 |
gpu-memory-utilization=0.86, max-num-seqs=176 |
infeasible |
trial-0012 |
gpu-memory-utilization=0.86, max-num-batched-tokens=896 |
infeasible |
Key insights
- This
0323window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change. EP=4was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.- The best point is very close to the SLO edge:
TPOT p95 ~= 39.55ms, so the remaining headroom is small. - Compared with the heavier
0327decode-only tuning,0323is a milder window: baseline is already strong, and tuning only adds about11.5%.
Recommendation
For a 0323-like decode-only thinking window, keep the baseline TP4/DP2/EP8 topology and use:
gpu-memory-utilization=0.86
Do not treat this run as evidence that 0323 prefers the same topology changes as 0327; this study mainly supports a small residency-headroom refinement.