# qwen235b-thinking-decode-0323 qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260323_1000` with `TPOT <= 40ms`. ## Setup - Hardware: `dash2`, `8x H20` - Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717` - Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}` - Baseline topology: `TP=4, DP=2, EP=8` - Trace: `thinking_w20260323_1000` - Trace source: `trace_windows/traces/thinking_w20260323_1000.jsonl` - Window duration: `600s` (`10:00-10:10`, `2026-03-23`) - Request mode: `decode_only` - SLO: - pass target: `95%` - `TPOT <= 40ms` - `TTFT` not enforced - Search: - `sampling_u in [0, 0.125]` - `max_probes = 6` - `12` trials total - Proposal model: `codex / gpt-5.4` ## Run assets - Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology` - State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json` - Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json` ## Best result - Best trial: `trial-0007` - Best config delta: - `gpu-memory-utilization=0.86` - Best topology stayed unchanged: - `tensor-parallel-size=4` - `data-parallel-size=2` - `expert-parallel-size=8` - Best `sampling_u`: `0.033736228943` - Best request rate: `0.48333333333333334 req/s` - Best request rate per GPU: `0.06041666666666667 req/s/gpu` - Best pass rate: `0.9551724137931035` Compared with baseline: - `trial-0001`: `0.43333333333333335 req/s`, `0.05416666666666667 req/s/gpu` - `trial-0007`: `0.48333333333333334 req/s`, `0.06041666666666667 req/s/gpu` - Throughput gain: `1.12x` Best-point latency: - `TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms` ## 12-trial summary | Trial | Proposed config delta | Result | | --- | --- | --- | | `trial-0001` | baseline `TP4/DP2/EP8` | `0.4333 req/s`, feasible | | `trial-0002` | `EP=4` | launch fail | | `trial-0003` | `gpu-memory-utilization=0.8`, `max-num-seqs=224` | infeasible | | `trial-0004` | `gpu-memory-utilization=0.8` | infeasible | | `trial-0005` | `gpu-memory-utilization=0.82` | `0.4650 req/s`, feasible | | `trial-0006` | `gpu-memory-utilization=0.84` | infeasible | | `trial-0007` | `gpu-memory-utilization=0.86` | `0.4833 req/s`, feasible, best | | `trial-0008` | `gpu-memory-utilization=0.87` | infeasible | | `trial-0009` | `gpu-memory-utilization=0.86`, `block-size=32` | launch fail | | `trial-0010` | `gpu-memory-utilization=0.86`, `max-num-seqs=208` | infeasible | | `trial-0011` | `gpu-memory-utilization=0.86`, `max-num-seqs=176` | infeasible | | `trial-0012` | `gpu-memory-utilization=0.86`, `max-num-batched-tokens=896` | infeasible | ## Key insights - This `0323` window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change. - `EP=4` was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes. - The best point is very close to the SLO edge: `TPOT p95 ~= 39.55ms`, so the remaining headroom is small. - Compared with the heavier `0327` decode-only tuning, `0323` is a milder window: baseline is already strong, and tuning only adds about `11.5%`. ## Recommendation For a `0323`-like decode-only `thinking` window, keep the baseline `TP4/DP2/EP8` topology and use: - `gpu-memory-utilization=0.86` Do not treat this run as evidence that `0323` prefers the same topology changes as `0327`; this study mainly supports a small residency-headroom refinement.