Files

86 lines
3.7 KiB
Markdown

# qwen235b-thinking-decode-0323
qwen3-235b-a22b `thinking` trace, `decode_only` mode, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260323_1000` with `TPOT <= 40ms`.
## Setup
- Hardware: `dash2`, `8x H20`
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
- Engine: internal vLLM, decode-only mode with `--kv-transfer-config {"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}`
- Baseline topology: `TP=4, DP=2, EP=8`
- Trace: `thinking_w20260323_1000`
- Trace source: `trace_windows/traces/thinking_w20260323_1000.jsonl`
- Window duration: `600s` (`10:00-10:10`, `2026-03-23`)
- Request mode: `decode_only`
- SLO:
- pass target: `95%`
- `TPOT <= 40ms`
- `TTFT` not enforced
- Search:
- `sampling_u in [0, 0.125]`
- `max_probes = 6`
- `12` trials total
- Proposal model: `codex / gpt-5.4`
## Run assets
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology`
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/dash2-qwen235b-decode-thinking-run1-0323-tpot40-topology/state.json`
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-decode/specs/dash2_qwen235b_decode_thinking_run1_0323_tpot40_topology.json`
## Best result
- Best trial: `trial-0007`
- Best config delta:
- `gpu-memory-utilization=0.86`
- Best topology stayed unchanged:
- `tensor-parallel-size=4`
- `data-parallel-size=2`
- `expert-parallel-size=8`
- Best `sampling_u`: `0.033736228943`
- Best request rate: `0.48333333333333334 req/s`
- Best request rate per GPU: `0.06041666666666667 req/s/gpu`
- Best pass rate: `0.9551724137931035`
Compared with baseline:
- `trial-0001`: `0.43333333333333335 req/s`, `0.05416666666666667 req/s/gpu`
- `trial-0007`: `0.48333333333333334 req/s`, `0.06041666666666667 req/s/gpu`
- Throughput gain: `1.12x`
Best-point latency:
- `TPOT mean/p50/p90/p95/p99 = 26.18 / 24.04 / 38.46 / 39.55 / 40.76 ms`
## 12-trial summary
| Trial | Proposed config delta | Result |
| --- | --- | --- |
| `trial-0001` | baseline `TP4/DP2/EP8` | `0.4333 req/s`, feasible |
| `trial-0002` | `EP=4` | launch fail |
| `trial-0003` | `gpu-memory-utilization=0.8`, `max-num-seqs=224` | infeasible |
| `trial-0004` | `gpu-memory-utilization=0.8` | infeasible |
| `trial-0005` | `gpu-memory-utilization=0.82` | `0.4650 req/s`, feasible |
| `trial-0006` | `gpu-memory-utilization=0.84` | infeasible |
| `trial-0007` | `gpu-memory-utilization=0.86` | `0.4833 req/s`, feasible, best |
| `trial-0008` | `gpu-memory-utilization=0.87` | infeasible |
| `trial-0009` | `gpu-memory-utilization=0.86`, `block-size=32` | launch fail |
| `trial-0010` | `gpu-memory-utilization=0.86`, `max-num-seqs=208` | infeasible |
| `trial-0011` | `gpu-memory-utilization=0.86`, `max-num-seqs=176` | infeasible |
| `trial-0012` | `gpu-memory-utilization=0.86`, `max-num-batched-tokens=896` | infeasible |
## Key insights
- This `0323` window did not produce a better topology than the baseline. The winning move was a small memory-headroom increase, not a TP/DP/EP change.
- `EP=4` was not viable under the current deployment shape and failed at launch, so the run quickly converged away from topology changes.
- The best point is very close to the SLO edge: `TPOT p95 ~= 39.55ms`, so the remaining headroom is small.
- Compared with the heavier `0327` decode-only tuning, `0323` is a milder window: baseline is already strong, and tuning only adds about `11.5%`.
## Recommendation
For a `0323`-like decode-only `thinking` window, keep the baseline `TP4/DP2/EP8` topology and use:
- `gpu-memory-utilization=0.86`
Do not treat this run as evidence that `0323` prefers the same topology changes as `0327`; this study mainly supports a small residency-headroom refinement.