91 lines
3.8 KiB
Markdown
91 lines
3.8 KiB
Markdown
# qwen235b-thinking-decode
|
|
|
|
qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), SLO: `p95-equivalent pass target 95%`, `TPOT <= 40ms`, `TTFT` not enforced.
|
|
|
|
## Setup
|
|
|
|
- Hardware: `dash0`, `8x H20`
|
|
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
|
- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}`
|
|
- Baseline topology: `TP=4, DP=2, EP=8`
|
|
- Trace: `thinking_w20260327_1000`
|
|
- Trace source: `trace_windows/traces/thinking_w20260327_1000.jsonl`
|
|
- Window duration: `600s` (`10:00-10:10`, `2026-03-27`)
|
|
- Request mode: `decode_only`
|
|
- SLO:
|
|
- pass target: `95%`
|
|
- `TPOT <= 40ms`
|
|
- `TTFT` not enforced
|
|
- Search:
|
|
- `sampling_u in [0, 0.125]`
|
|
- `max_probes = 6`
|
|
- `12` trials total
|
|
- Proposal model: `codex / gpt-5.4`
|
|
|
|
## Run assets
|
|
|
|
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology`
|
|
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash0-qwen235b-decode-thinking-run5-tpot40-topology/state.json`
|
|
- Log: `/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.log`
|
|
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash0_qwen235b_decode_thinking_run5_tpot40_topology.json`
|
|
|
|
## Best result
|
|
|
|
- Best trial: `trial-0009`
|
|
- Best config:
|
|
- `tensor-parallel-size=2`
|
|
- `data-parallel-size=4`
|
|
- `expert-parallel-size=8`
|
|
- `max-num-seqs=128`
|
|
- `max-num-batched-tokens=256`
|
|
- Best `sampling_u`: `0.013218402863`
|
|
- Best request rate: `0.2816666666666667 req/s`
|
|
- Best pass rate: `0.9704142011834319`
|
|
|
|
Compared with baseline:
|
|
|
|
- `trial-0001`: `0.12666666666666668 req/s`
|
|
- `trial-0009`: `0.2816666666666667 req/s`
|
|
- Throughput gain: `2.22x`
|
|
|
|
Best-point latency:
|
|
|
|
- `TPOT mean/p50/p90/p95/p99 = 27.49 / 26.50 / 38.73 / 39.57 / 40.73 ms`
|
|
- `TTFT mean/p50/p90/p95/p99 = 216.33 / 199.62 / 298.56 / 308.91 / 317.00 ms`
|
|
|
|
## 12-trial summary
|
|
|
|
| Trial | Proposed config delta | Result |
|
|
| --- | --- | --- |
|
|
| `trial-0001` | baseline | `0.1267 req/s`, feasible |
|
|
| `trial-0002` | `TP=2, DP=4` | `0.2450 req/s`, feasible |
|
|
| `trial-0003` | `TP=1, DP=8, EP=8` | infeasible |
|
|
| `trial-0004` | `TP=2, DP=4, EP=4` | launch fail |
|
|
| `trial-0005` | `gpu-memory-utilization=0.8`, `max-num-seqs=256` | infeasible |
|
|
| `trial-0006` | `max-num-seqs=128` | infeasible |
|
|
| `trial-0007` | `block-size=128` | infeasible |
|
|
| `trial-0008` | `max-num-batched-tokens=384` | infeasible |
|
|
| `trial-0009` | `TP=2, DP=4, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256` | `0.2817 req/s`, feasible, best |
|
|
| `trial-0010` | `trial-0009 + block-size=128` | infeasible |
|
|
| `trial-0011` | `TP=1, DP=8, EP=8`, `max-num-seqs=128`, `max-num-batched-tokens=256` | infeasible |
|
|
| `trial-0012` | `TP=2, DP=4, EP=8`, `max-num-seqs=96`, `max-num-batched-tokens=192` | infeasible |
|
|
|
|
## Key insights
|
|
|
|
- Removing `VLLM_USE_FLASHINFER_SAMPLER`, `CUDA_DEVICE_MAX_CONNECTIONS`, `VLLM_ENABLE_TBO_OPT` from tunable space did not block progress; the win came from `parallel space`.
|
|
- The first strong improvement was topology: `TP4/DP2/EP8 -> TP2/DP4/EP8`.
|
|
- `TP1/DP8/EP8` launched, but did not beat `TP2/DP4/EP8`.
|
|
- `EP4` under `TP2/DP4` failed at launch and should be treated as negative evidence for this stack.
|
|
- After topology settled at `TP2/DP4/EP8`, the useful runtime refinement was tighter decode batching: `max-num-seqs=128`, `max-num-batched-tokens=256`.
|
|
- Harness mechanism and ablation notes are in `one-shot-mechanism-ablation-20260502.md`.
|
|
|
|
## Current recommendation
|
|
|
|
Use the `trial-0009` shape as the default decode-only serving config for this workload:
|
|
|
|
- `tensor-parallel-size=2`
|
|
- `data-parallel-size=4`
|
|
- `expert-parallel-size=8`
|
|
- `max-num-seqs=128`
|
|
- `max-num-batched-tokens=256`
|