From b960607d8f45dd6303cf609806a9f77c556ecab6 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Fri, 10 Apr 2026 17:33:08 +0800 Subject: [PATCH] Add qwen235b thinking decode tuning note --- docs/qwen235b-thinking-decode/README.md | 69 +++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 docs/qwen235b-thinking-decode/README.md diff --git a/docs/qwen235b-thinking-decode/README.md b/docs/qwen235b-thinking-decode/README.md new file mode 100644 index 0000000..9cdf212 --- /dev/null +++ b/docs/qwen235b-thinking-decode/README.md @@ -0,0 +1,69 @@ +# qwen235b-thinking-decode + +qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), SLO: `p95-equivalent pass target 95%`, `TPOT <= 40ms`, `TTFT` not enforced. + +## Run assets + +- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology` +- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology/state.json` +- Log: `/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.log` +- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.json` + +## Best result + +- Best trial: `trial-0009` +- Best config: + - `tensor-parallel-size=2` + - `data-parallel-size=4` + - `expert-parallel-size=8` + - `max-num-seqs=128` + - `max-num-batched-tokens=256` +- Best `sampling_u`: `0.013218402863` +- Best request rate: `0.2816666666666667 req/s` +- Best pass rate: `0.9704142011834319` + +Compared with baseline: + +- `trial-0001`: `0.12666666666666668 req/s` +- `trial-0009`: `0.2816666666666667 req/s` +- Throughput gain: `2.22x` + +Best-point latency: + +- `TPOT mean/p50/p90/p95/p99 = 27.49 / 26.50 / 38.73 / 39.57 / 40.73 ms` +- `TTFT mean/p50/p90/p95/p99 = 216.33 / 199.62 / 298.56 / 308.91 / 317.00 ms` + +## 12-trial summary + +| Trial | Proposed config delta | Result | +| --- | --- | --- | +| `trial-0001` | baseline | `0.1267 req/s`, feasible | +| `trial-0002` | `TP=2, DP=4` | `0.2450 req/s`, feasible | +| `trial-0003` | `TP=1, DP=8, EP=8` | infeasible | +| `trial-0004` | `TP=2, DP=4, EP=4` | launch fail | +| `trial-0005` | `gpu-memory-utilization=0.8`, `max-num-seqs=256` | infeasible | +| `trial-0006` | `max-num-seqs=128` | infeasible | +| `trial-0007` | `block-size=128` | infeasible | +| `trial-0008` | `max-num-batched-tokens=384` | infeasible | +| `trial-0009` | `TP=2, DP=4, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256` | `0.2817 req/s`, feasible, best | +| `trial-0010` | `trial-0009 + block-size=128` | infeasible | +| `trial-0011` | `TP=1, DP=8, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256` | infeasible | +| `trial-0012` | `TP=2, DP=4, EP=8`, `max-num-seqs=96`, `max-num-batched-tokens=192` | infeasible | + +## Key insights + +- Removing `VLLM_USE_FLASHINFER_SAMPLER`, `CUDA_DEVICE_MAX_CONNECTIONS`, `VLLM_ENABLE_TBO_OPT` from tunable space did not block progress; the win came from `parallel space`. +- The first strong improvement was topology: `TP4/DP2/EP8 -> TP2/DP4/EP8`. +- `TP1/DP8/EP8` launched, but did not beat `TP2/DP4/EP8`. +- `EP4` under `TP2/DP4` failed at launch and should be treated as negative evidence for this stack. +- After topology settled at `TP2/DP4/EP8`, the useful runtime refinement was tighter decode batching: `max-num-seqs=128`, `max-num-batched-tokens=256`. + +## Current recommendation + +Use the `trial-0009` shape as the default decode-only serving config for this workload: + +- `tensor-parallel-size=2` +- `data-parallel-size=4` +- `expert-parallel-size=8` +- `max-num-seqs=128` +- `max-num-batched-tokens=256`