4.3 KiB
4.3 KiB
qwen235b-thinking-prefill-ttft-1s-2s-0-32k
qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on the 0~32k input bucket under a stricter stepped TTFT SLO.
Setup
- Hardware:
dash1,8x H20 - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Engine: internal vLLM, baseline aligned to
~/run_qwen235b.sh - Baseline topology:
TP=4, DP=1, EP=1 - Trace:
thinking_w20260327_1000 - Trace source:
trace_windows/traces/thinking_w20260327_1000.jsonl - Window duration:
600s(10:00-10:10,2026-03-27) - Request mode:
chat - Replay override:
min_tokens=max_tokens=1 - Input bucket:
0 <= input_length <= 32768 - SLO:
- pass target:
95% TTFT <= 1000msfor<=8191input tokensTTFT <= 2000msfor<=32767input tokensTTFT <= 2000msfallback bucket
- pass target:
- Search:
sampling_u in [0, 0.125]max_probes = 612trials total
- Proposal model:
codex / gpt-5.4
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash1-qwen235b-prefill-thinking-run5-ttft-1s-2s-0-32k-topology/state.json - Log:
/home/admin/cpfs/wjh/aituner/aituner/logs/q235b_prefill_1s2s_0_32k.log - Spec:
/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash1_qwen235b_prefill_thinking_run5_ttft_1s_2s_0_32k.json
Best result
- Best trial:
trial-0011 - Best config:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=4096max-num-seqs=16VLLM_ENABLE_TORCH_COMPILE=0
- Best
sampling_u:0.073767628521 - Best request rate:
1.8516666666666666 req/s - Best request rate per GPU:
0.23145833333333332 req/s/gpu - Best pass rate:
0.9558955895589559
Compared with baseline:
trial-0001:0.47 req/s,0.1175 req/s/gputrial-0011:1.8516666666666666 req/s,0.23145833333333332 req/s/gpu- Raw throughput gain:
3.94x - Per-GPU throughput gain:
1.97x
Best-point latency:
- baseline
trial-0001TTFT mean/p50/p90/p95/p99 =236.68 / 75.19 / 294.39 / 1378.79 / 3118.86 ms - best
trial-0011TTFT mean/p50/p90/p95/p99 =223.70 / 65.67 / 261.69 / 1065.31 / 3648.34 ms
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP4/DP1/EP-off, compile on |
0.4700 req/s, feasible |
trial-0002 |
TP4/DP2, EP-off |
probe-search failure |
trial-0003 |
TP4/DP1/EP4, max-num-batched-tokens=4096 |
launch fail |
trial-0004 |
VLLM_ENABLE_TORCH_COMPILE=0, max-num-batched-tokens=6144 |
infeasible |
trial-0005 |
compile off, max-num-batched-tokens=4096 |
infeasible |
trial-0006 |
compile off, max-num-seqs=32 |
infeasible |
trial-0007 |
compile off, TP8/DP1/EP-off |
1.3817 req/s, feasible |
trial-0008 |
trial-0007 + max-num-seqs=32 |
1.5983 req/s, feasible |
trial-0009 |
trial-0008 + max-num-batched-tokens=6144 |
1.8017 req/s, feasible |
trial-0010 |
trial-0008 + max-num-batched-tokens=4096 |
1.8300 req/s, feasible |
trial-0011 |
trial-0010 + max-num-seqs=16 |
1.8517 req/s, feasible, best |
trial-0012 |
trial-0011 + max-num-batched-tokens=3072 |
infeasible |
Key insights
- Under the stricter
1s/2sTTFT SLO, the main win still came from topology first:TP4 -> TP8. TP4/DP2andEP4remain negative evidence in this stack. The former failed in probe search; the latter failed at engine launch.- Runtime-only tuning inside the 4-GPU topology did not beat baseline at all. The useful search space opened only after moving to
TP8/DP1. - After the
TP8switch, the winning runtime shape became more conservative than the looser prefill studies:max-num-batched-tokens=4096andmax-num-seqs=16. - This run shows that even under a much tighter TTFT target, the
TP8shape still improves both raw throughput and per-GPU throughput materially over baseline.
Recommendation
For qwen235b thinking prefill-only on the 0~32k bucket under the 1s/2s stepped TTFT SLO, use:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=4096max-num-seqs=16VLLM_ENABLE_TORCH_COMPILE=0
Keep the rest of the run_qwen235b.sh baseline unchanged.