docs: add q235b prefill 0-32k tight summary
This commit is contained in:
98
docs/qwen235b-thinking-prefill/ttft-1s-2s-0-32k.md
Normal file
98
docs/qwen235b-thinking-prefill/ttft-1s-2s-0-32k.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# qwen235b-thinking-prefill-ttft-1s-2s-0-32k
|
||||
|
||||
qwen3-235b-a22b `thinking` trace, prefill-only replay with `output_length=1`, internal vLLM (`/usr/local/bin/vllm`), tuned on the `0~32k` input bucket under a stricter stepped TTFT SLO.
|
||||
|
||||
## Setup
|
||||
|
||||
- Hardware: `dash1`, `8x H20`
|
||||
- Model: `/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717`
|
||||
- Engine: internal vLLM, baseline aligned to `~/run_qwen235b.sh`
|
||||
- Baseline topology: `TP=4, DP=1, EP=1`
|
||||
- Trace: `thinking_w20260327_1000`
|
||||
- Trace source: `trace_windows/traces/thinking_w20260327_1000.jsonl`
|
||||
- Window duration: `600s` (`10:00-10:10`, `2026-03-27`)
|
||||
- Request mode: `chat`
|
||||
- Replay override: `min_tokens=max_tokens=1`
|
||||
- Input bucket: `0 <= input_length <= 32768`
|
||||
- SLO:
|
||||
- pass target: `95%`
|
||||
- `TTFT <= 1000ms` for `<=8191` input tokens
|
||||
- `TTFT <= 2000ms` for `<=32767` input tokens
|
||||
- `TTFT <= 2000ms` fallback bucket
|
||||
- Search:
|
||||
- `sampling_u in [0, 0.125]`
|
||||
- `max_probes = 6`
|
||||
- `12` trials total
|
||||
- Proposal model: `codex / gpt-5.4`
|
||||
|
||||
## Run assets
|
||||
|
||||
- Study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology`
|
||||
- State: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology/state.json`
|
||||
- Log: `/home/admin/cpfs/wjh/aituner/aituner/logs/q235b_prefill_1s2s_0_32k.log`
|
||||
- Spec: `/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_run5_ttft_1s_2s_0_32k.json`
|
||||
|
||||
## Best result
|
||||
|
||||
- Best trial: `trial-0011`
|
||||
- Best config:
|
||||
- `tensor-parallel-size=8`
|
||||
- `data-parallel-size=1`
|
||||
- `enable-expert-parallel=false`
|
||||
- `max-num-batched-tokens=4096`
|
||||
- `max-num-seqs=16`
|
||||
- `VLLM_ENABLE_TORCH_COMPILE=0`
|
||||
- Best `sampling_u`: `0.073767628521`
|
||||
- Best request rate: `1.8516666666666666 req/s`
|
||||
- Best request rate per GPU: `0.23145833333333332 req/s/gpu`
|
||||
- Best pass rate: `0.9558955895589559`
|
||||
|
||||
Compared with baseline:
|
||||
|
||||
- `trial-0001`: `0.47 req/s`, `0.1175 req/s/gpu`
|
||||
- `trial-0011`: `1.8516666666666666 req/s`, `0.23145833333333332 req/s/gpu`
|
||||
- Raw throughput gain: `3.94x`
|
||||
- Per-GPU throughput gain: `1.97x`
|
||||
|
||||
Best-point latency:
|
||||
|
||||
- baseline `trial-0001` TTFT mean/p50/p90/p95/p99 = `236.68 / 75.19 / 294.39 / 1378.79 / 3118.86 ms`
|
||||
- best `trial-0011` TTFT mean/p50/p90/p95/p99 = `223.70 / 65.67 / 261.69 / 1065.31 / 3648.34 ms`
|
||||
|
||||
## 12-trial summary
|
||||
|
||||
| Trial | Proposed config delta | Result |
|
||||
| --- | --- | --- |
|
||||
| `trial-0001` | baseline `TP4/DP1/EP-off`, compile on | `0.4700 req/s`, feasible |
|
||||
| `trial-0002` | `TP4/DP2`, `EP-off` | probe-search failure |
|
||||
| `trial-0003` | `TP4/DP1/EP4`, `max-num-batched-tokens=4096` | launch fail |
|
||||
| `trial-0004` | `VLLM_ENABLE_TORCH_COMPILE=0`, `max-num-batched-tokens=6144` | infeasible |
|
||||
| `trial-0005` | compile off, `max-num-batched-tokens=4096` | infeasible |
|
||||
| `trial-0006` | compile off, `max-num-seqs=32` | infeasible |
|
||||
| `trial-0007` | compile off, `TP8/DP1/EP-off` | `1.3817 req/s`, feasible |
|
||||
| `trial-0008` | `trial-0007 + max-num-seqs=32` | `1.5983 req/s`, feasible |
|
||||
| `trial-0009` | `trial-0008 + max-num-batched-tokens=6144` | `1.8017 req/s`, feasible |
|
||||
| `trial-0010` | `trial-0008 + max-num-batched-tokens=4096` | `1.8300 req/s`, feasible |
|
||||
| `trial-0011` | `trial-0010 + max-num-seqs=16` | `1.8517 req/s`, feasible, best |
|
||||
| `trial-0012` | `trial-0011 + max-num-batched-tokens=3072` | infeasible |
|
||||
|
||||
## Key insights
|
||||
|
||||
- Under the stricter `1s/2s` TTFT SLO, the main win still came from topology first: `TP4 -> TP8`.
|
||||
- `TP4/DP2` and `EP4` remain negative evidence in this stack. The former failed in probe search; the latter failed at engine launch.
|
||||
- Runtime-only tuning inside the 4-GPU topology did not beat baseline at all. The useful search space opened only after moving to `TP8/DP1`.
|
||||
- After the `TP8` switch, the winning runtime shape became more conservative than the looser prefill studies: `max-num-batched-tokens=4096` and `max-num-seqs=16`.
|
||||
- This run shows that even under a much tighter TTFT target, the `TP8` shape still improves both raw throughput and per-GPU throughput materially over baseline.
|
||||
|
||||
## Recommendation
|
||||
|
||||
For `qwen235b thinking prefill-only` on the `0~32k` bucket under the `1s/2s` stepped TTFT SLO, use:
|
||||
|
||||
- `tensor-parallel-size=8`
|
||||
- `data-parallel-size=1`
|
||||
- `enable-expert-parallel=false`
|
||||
- `max-num-batched-tokens=4096`
|
||||
- `max-num-seqs=16`
|
||||
- `VLLM_ENABLE_TORCH_COMPILE=0`
|
||||
|
||||
Keep the rest of the `run_qwen235b.sh` baseline unchanged.
|
||||
Reference in New Issue
Block a user