Files
aituner/docs/qwen235b-thinking-decode/0323.md

3.7 KiB

qwen235b-thinking-decode-0323

qwen3-235b-a22b thinking trace, decode_only mode, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260323_1000 with TPOT <= 40ms.

Setup

  • Hardware: dash2, 8x H20
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Engine: internal vLLM, decode-only mode with --kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}
  • Baseline topology: TP=4, DP=2, EP=8
  • Trace: thinking_w20260323_1000
  • Trace source: trace_windows/traces/thinking_w20260323_1000.jsonl
  • Window duration: 600s (10:00-10:10, 2026-03-23)
  • Request mode: decode_only
  • SLO:
    • pass target: 95%
    • TPOT <= 40ms
    • TTFT not enforced
  • Search:
    • sampling_u in [0, 0.125]
    • max_probes = 6
    • 12 trials total
  • Proposal model: codex / gpt-5.4

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json

Best result

  • Best trial: trial-0007
  • Best config delta:
    • gpu-memory-utilization=0.86
  • Best topology stayed unchanged:
    • tensor-parallel-size=4
    • data-parallel-size=2
    • expert-parallel-size=8
  • Best sampling_u: 0.033736228943
  • Best request rate: 0.48333333333333334 req/s
  • Best request rate per GPU: 0.06041666666666667 req/s/gpu
  • Best pass rate: 0.9551724137931035

Compared with baseline:

  • trial-0001: 0.43333333333333335 req/s, 0.05416666666666667 req/s/gpu
  • trial-0007: 0.48333333333333334 req/s, 0.06041666666666667 req/s/gpu
  • Throughput gain: 1.12x

Best-point latency:

  • TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline TP4/DP2/EP8 0.4333 req/s, feasible
trial-0002 EP=4 launch fail
trial-0003 gpu-memory-utilization=0.8, max-num-seqs=224 infeasible
trial-0004 gpu-memory-utilization=0.8 infeasible
trial-0005 gpu-memory-utilization=0.82 0.4650 req/s, feasible
trial-0006 gpu-memory-utilization=0.84 infeasible
trial-0007 gpu-memory-utilization=0.86 0.4833 req/s, feasible, best
trial-0008 gpu-memory-utilization=0.87 infeasible
trial-0009 gpu-memory-utilization=0.86, block-size=32 launch fail
trial-0010 gpu-memory-utilization=0.86, max-num-seqs=208 infeasible
trial-0011 gpu-memory-utilization=0.86, max-num-seqs=176 infeasible
trial-0012 gpu-memory-utilization=0.86, max-num-batched-tokens=896 infeasible

Key insights

  • This 0323 window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change.
  • EP=4 was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.
  • The best point is very close to the SLO edge: TPOT p95 ~= 39.55ms, so the remaining headroom is small.
  • Compared with the heavier 0327 decode-only tuning, 0323 is a milder window: baseline is already strong, and tuning only adds about 11.5%.

Recommendation

For a 0323-like decode-only thinking window, keep the baseline TP4/DP2/EP8 topology and use:

  • gpu-memory-utilization=0.86

Do not treat this run as evidence that 0323 prefers the same topology changes as 0327; this study mainly supports a small residency-headroom refinement.