docs: add qwen235b decode 0323 summary
This commit is contained in:
85
docs/qwen235b-thinking-decode/0323.md
Normal file
85
docs/qwen235b-thinking-decode/0323.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# qwen235b-thinking-decode-0323
|
||||
|
||||
qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260323_1000` with `TPOT <= 40ms`.
|
||||
|
||||
## Setup
|
||||
|
||||
- Hardware: `dash2`, `8x H20`
|
||||
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
||||
- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}`
|
||||
- Baseline topology: `TP=4, DP=2, EP=8`
|
||||
- Trace: `thinking_w20260323_1000`
|
||||
- Trace source: `trace_windows/traces/thinking_w20260323_1000.jsonl`
|
||||
- Window duration: `600s` (`10:00-10:10`, `2026-03-23`)
|
||||
- Request mode: `decode_only`
|
||||
- SLO:
|
||||
- pass target: `95%`
|
||||
- `TPOT <= 40ms`
|
||||
- `TTFT` not enforced
|
||||
- Search:
|
||||
- `sampling_u in [0, 0.125]`
|
||||
- `max_probes = 6`
|
||||
- `12` trials total
|
||||
- Proposal model: `codex / gpt-5.4`
|
||||
|
||||
## Run assets
|
||||
|
||||
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology`
|
||||
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json`
|
||||
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json`
|
||||
|
||||
## Best result
|
||||
|
||||
- Best trial: `trial-0007`
|
||||
- Best config delta:
|
||||
- `gpu-memory-utilization=0.86`
|
||||
- Best topology stayed unchanged:
|
||||
- `tensor-parallel-size=4`
|
||||
- `data-parallel-size=2`
|
||||
- `expert-parallel-size=8`
|
||||
- Best `sampling_u`: `0.033736228943`
|
||||
- Best request rate: `0.48333333333333334 req/s`
|
||||
- Best request rate per GPU: `0.06041666666666667 req/s/gpu`
|
||||
- Best pass rate: `0.9551724137931035`
|
||||
|
||||
Compared with baseline:
|
||||
|
||||
- `trial-0001`: `0.43333333333333335 req/s`, `0.05416666666666667 req/s/gpu`
|
||||
- `trial-0007`: `0.48333333333333334 req/s`, `0.06041666666666667 req/s/gpu`
|
||||
- Throughput gain: `1.12x`
|
||||
|
||||
Best-point latency:
|
||||
|
||||
- `TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms`
|
||||
|
||||
## 12-trial summary
|
||||
|
||||
| Trial | Proposed config delta | Result |
|
||||
| --- | --- | --- |
|
||||
| `trial-0001` | baseline `TP4/DP2/EP8` | `0.4333 req/s`, feasible |
|
||||
| `trial-0002` | `EP=4` | launch fail |
|
||||
| `trial-0003` | `gpu-memory-utilization=0.8`, `max-num-seqs=224` | infeasible |
|
||||
| `trial-0004` | `gpu-memory-utilization=0.8` | infeasible |
|
||||
| `trial-0005` | `gpu-memory-utilization=0.82` | `0.4650 req/s`, feasible |
|
||||
| `trial-0006` | `gpu-memory-utilization=0.84` | infeasible |
|
||||
| `trial-0007` | `gpu-memory-utilization=0.86` | `0.4833 req/s`, feasible, best |
|
||||
| `trial-0008` | `gpu-memory-utilization=0.87` | infeasible |
|
||||
| `trial-0009` | `gpu-memory-utilization=0.86`, `block-size=32` | launch fail |
|
||||
| `trial-0010` | `gpu-memory-utilization=0.86`, `max-num-seqs=208` | infeasible |
|
||||
| `trial-0011` | `gpu-memory-utilization=0.86`, `max-num-seqs=176` | infeasible |
|
||||
| `trial-0012` | `gpu-memory-utilization=0.86`, `max-num-batched-tokens=896` | infeasible |
|
||||
|
||||
## Key insights
|
||||
|
||||
- This `0323` window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change.
|
||||
- `EP=4` was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.
|
||||
- The best point is very close to the SLO edge: `TPOT p95 ~= 39.55ms`, so the remaining headroom is small.
|
||||
- Compared with the heavier `0327` decode-only tuning, `0323` is a milder window: baseline is already strong, and tuning only adds about `11.5%`.
|
||||
|
||||
## Recommendation
|
||||
|
||||
For a `0323`-like decode-only `thinking` window, keep the baseline `TP4/DP2/EP8` topology and use:
|
||||
|
||||
- `gpu-memory-utilization=0.86`
|
||||
|
||||
Do not treat this run as evidence that `0323` prefers the same topology changes as `0327`; this study mainly supports a small residency-headroom refinement.
|
||||
Reference in New Issue
Block a user