From ee9ec3c60bc9505893ad55bb4b7d49a92c32c10b Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Mon, 13 Apr 2026 09:33:02 +0800 Subject: [PATCH] docs: add qwen235b decode 0323 summary --- docs/qwen235b-thinking-decode/0323.md | 85 +++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 docs/qwen235b-thinking-decode/0323.md diff --git a/docs/qwen235b-thinking-decode/0323.md b/docs/qwen235b-thinking-decode/0323.md new file mode 100644 index 0000000..f4d52a3 --- /dev/null +++ b/docs/qwen235b-thinking-decode/0323.md @@ -0,0 +1,85 @@ +# qwen235b-thinking-decode-0323 + +qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260323_1000` with `TPOT <= 40ms`. + +## Setup + +- Hardware: `dash2`, `8x H20` +- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717` +- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}` +- Baseline topology: `TP=4, DP=2, EP=8` +- Trace: `thinking_w20260323_1000` +- Trace source: `trace_windows/traces/thinking_w20260323_1000.jsonl` +- Window duration: `600s` (`10:00-10:10`, `2026-03-23`) +- Request mode: `decode_only` +- SLO: + - pass target: `95%` + - `TPOT <= 40ms` + - `TTFT` not enforced +- Search: + - `sampling_u in [0, 0.125]` + - `max_probes = 6` + - `12` trials total +- Proposal model: `codex / gpt-5.4` + +## Run assets + +- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology` +- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json` +- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json` + +## Best result + +- Best trial: `trial-0007` +- Best config delta: + - `gpu-memory-utilization=0.86` +- Best topology stayed unchanged: + - `tensor-parallel-size=4` + - `data-parallel-size=2` + - `expert-parallel-size=8` +- Best `sampling_u`: `0.033736228943` +- Best request rate: `0.48333333333333334 req/s` +- Best request rate per GPU: `0.06041666666666667 req/s/gpu` +- Best pass rate: `0.9551724137931035` + +Compared with baseline: + +- `trial-0001`: `0.43333333333333335 req/s`, `0.05416666666666667 req/s/gpu` +- `trial-0007`: `0.48333333333333334 req/s`, `0.06041666666666667 req/s/gpu` +- Throughput gain: `1.12x` + +Best-point latency: + +- `TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms` + +## 12-trial summary + +| Trial | Proposed config delta | Result | +| --- | --- | --- | +| `trial-0001` | baseline `TP4/DP2/EP8` | `0.4333 req/s`, feasible | +| `trial-0002` | `EP=4` | launch fail | +| `trial-0003` | `gpu-memory-utilization=0.8`, `max-num-seqs=224` | infeasible | +| `trial-0004` | `gpu-memory-utilization=0.8` | infeasible | +| `trial-0005` | `gpu-memory-utilization=0.82` | `0.4650 req/s`, feasible | +| `trial-0006` | `gpu-memory-utilization=0.84` | infeasible | +| `trial-0007` | `gpu-memory-utilization=0.86` | `0.4833 req/s`, feasible, best | +| `trial-0008` | `gpu-memory-utilization=0.87` | infeasible | +| `trial-0009` | `gpu-memory-utilization=0.86`, `block-size=32` | launch fail | +| `trial-0010` | `gpu-memory-utilization=0.86`, `max-num-seqs=208` | infeasible | +| `trial-0011` | `gpu-memory-utilization=0.86`, `max-num-seqs=176` | infeasible | +| `trial-0012` | `gpu-memory-utilization=0.86`, `max-num-batched-tokens=896` | infeasible | + +## Key insights + +- This `0323` window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change. +- `EP=4` was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes. +- The best point is very close to the SLO edge: `TPOT p95 ~= 39.55ms`, so the remaining headroom is small. +- Compared with the heavier `0327` decode-only tuning, `0323` is a milder window: baseline is already strong, and tuning only adds about `11.5%`. + +## Recommendation + +For a `0323`-like decode-only `thinking` window, keep the baseline `TP4/DP2/EP8` topology and use: + +- `gpu-memory-utilization=0.86` + +Do not treat this run as evidence that `0323` prefers the same topology changes as `0327`; this study mainly supports a small residency-headroom refinement.