103 lines
4.2 KiB
Markdown
103 lines
4.2 KiB
Markdown
# qwen235b-thinking-prefill-ttft-tight-0327
|
|
|
|
qwen3-235b-a22b `thinking` trace, prefill-only replay with `output_length=1`, internal vLLM (`/usr/local/bin/vllm`), tuned on `thinking_w20260327_1000` under tighter stepped TTFT SLO.
|
|
|
|
## Setup
|
|
|
|
- Hardware: `dash0`, `8x H20`
|
|
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
|
- Engine: internal vLLM, baseline aligned to `~/run_qwen235b.sh`
|
|
- Baseline topology: `TP=4, DP=1, EP=1`
|
|
- Trace: `thinking_w20260327_1000`
|
|
- Trace source: `trace_windows/traces/thinking_w20260327_1000.jsonl`
|
|
- Window duration: `600s` (`10:00-10:10`, `2026-03-27`)
|
|
- Request mode: `chat`
|
|
- Replay override: `min_tokens=max_tokens=1`
|
|
- SLO:
|
|
- pass target: `95%`
|
|
- `TTFT <= 2000ms` for `<=8191` input tokens
|
|
- `TTFT <= 4000ms` for `<=32767` input tokens
|
|
- `TTFT <= 6000ms` for `>32767` input tokens
|
|
- Search:
|
|
- `sampling_u in [0, 0.125]`
|
|
- `max_probes = 6`
|
|
- `12` trials total
|
|
- Proposal model: `codex / gpt-5.4`
|
|
|
|
## Run assets
|
|
|
|
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology`
|
|
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology/state.json`
|
|
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run2_ttft_tight.json`
|
|
|
|
## Best result
|
|
|
|
- Best trial: `trial-0012`
|
|
- Best config:
|
|
- `tensor-parallel-size=8`
|
|
- `data-parallel-size=1`
|
|
- `enable-expert-parallel=false`
|
|
- `max-num-batched-tokens=6144`
|
|
- `max-num-seqs=48`
|
|
- `block-size=32`
|
|
- Best `sampling_u`: `0.098106384277`
|
|
- Best request rate: `2.4966666666666666 req/s`
|
|
- Best request rate per GPU: `0.3120833333333333 req/s/gpu`
|
|
- Best pass rate: `0.9506008010680908`
|
|
|
|
Compared with baseline:
|
|
|
|
- `trial-0001`: `0.4716666666666667 req/s`, `0.11791666666666667 req/s/gpu`
|
|
- `trial-0012`: `2.4966666666666666 req/s`, `0.3120833333333333 req/s/gpu`
|
|
- Raw throughput gain: `5.29x`
|
|
- Per-GPU throughput gain: `2.65x`
|
|
|
|
Compared with the looser TTFT study on the same `2026-03-27` window:
|
|
|
|
- looser-SLO best: `3.035 req/s`, `0.379375 req/s/gpu`
|
|
- tighter-SLO best: `2.4966666666666666 req/s`, `0.3120833333333333 req/s/gpu`
|
|
- throughput retained: `82.26%`
|
|
- throughput drop: `17.74%`
|
|
|
|
Best-point latency:
|
|
|
|
- `TTFT mean/p50/p90/p95/p99 = 413.92 / 67.86 / 1456.32 / 2286.90 / 5326.23 ms`
|
|
|
|
## 12-trial summary
|
|
|
|
| Trial | Proposed config delta | Result |
|
|
| --- | --- | --- |
|
|
| `trial-0001` | baseline `TP4/DP1/EP-off`, `max-num-batched-tokens=8192` | `0.4717 req/s`, feasible |
|
|
| `trial-0002` | `TP4/DP2` | probe-search failure |
|
|
| `trial-0003` | `TP8/DP1/EP-off` | `1.9200 req/s`, feasible |
|
|
| `trial-0004` | `TP8/DP1/EP8` | launch fail |
|
|
| `trial-0005` | `trial-0003 + max-num-batched-tokens=6144` | `2.2517 req/s`, feasible |
|
|
| `trial-0006` | `trial-0003 + max-num-batched-tokens=4096` | infeasible |
|
|
| `trial-0007` | `trial-0003 + max-num-batched-tokens=5120` | infeasible |
|
|
| `trial-0008` | `trial-0003 + max-num-batched-tokens=5632` | infeasible |
|
|
| `trial-0009` | `trial-0005 + max-num-seqs=32` | infeasible |
|
|
| `trial-0010` | `trial-0005 + max-num-seqs=48` | infeasible |
|
|
| `trial-0011` | `trial-0005 + block-size=32` | infeasible |
|
|
| `trial-0012` | `trial-0005 + max-num-seqs=48, block-size=32` | `2.4967 req/s`, feasible, best |
|
|
|
|
## Key insights
|
|
|
|
- This tuning is also on `2026-03-27`, not a different day. The change is the tighter TTFT step SLO.
|
|
- The best topology still moved to `TP8/DP1/no-EP`; tighter TTFT did not change the topology conclusion.
|
|
- Tighter TTFT did change the runtime sweet spot. The best runtime shape is not the looser-study `3712` token batch, but `6144 + max-num-seqs=48 + block-size=32`.
|
|
- `DP2` and `EP` remained negative evidence in this stack: `TP4/DP2` failed during probing, and `TP8 + EP8` failed at launch.
|
|
- Relative to the looser TTFT study on the same day, stricter TTFT costs about `17.7%` throughput, but the tuned result still keeps a large margin over baseline.
|
|
|
|
## Recommendation
|
|
|
|
For the tighter stepped TTFT SLO on `thinking_w20260327_1000`, use:
|
|
|
|
- `tensor-parallel-size=8`
|
|
- `data-parallel-size=1`
|
|
- `enable-expert-parallel=false`
|
|
- `max-num-batched-tokens=6144`
|
|
- `max-num-seqs=48`
|
|
- `block-size=32`
|
|
|
|
Keep the rest of the `run_qwen235b.sh` baseline unchanged.
|