4.2 KiB
4.2 KiB
qwen235b-thinking-prefill-ttft-tight-0327
qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260327_1000 under tighter stepped TTFT SLO.
Setup
- Hardware:
dash0,8x H20 - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Engine: internal vLLM, baseline aligned to
~/run_qwen235b.sh - Baseline topology:
TP=4, DP=1, EP=1 - Trace:
thinking_w20260327_1000 - Trace source:
trace_windows/traces/thinking_w20260327_1000.jsonl - Window duration:
600s(10:00-10:10,2026-03-27) - Request mode:
chat - Replay override:
min_tokens=max_tokens=1 - SLO:
- pass target:
95% TTFT <= 2000msfor<=8191input tokensTTFT <= 4000msfor<=32767input tokensTTFT <= 6000msfor>32767input tokens
- pass target:
- Search:
sampling_u in [0, 0.125]max_probes = 612trials total
- Proposal model:
codex / gpt-5.4
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology/state.json - Spec:
/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run2_ttft_tight.json
Best result
- Best trial:
trial-0012 - Best config:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=6144max-num-seqs=48block-size=32
- Best
sampling_u:0.098106384277 - Best request rate:
2.4966666666666666 req/s - Best request rate per GPU:
0.3120833333333333 req/s/gpu - Best pass rate:
0.9506008010680908
Compared with baseline:
trial-0001:0.4716666666666667 req/s,0.11791666666666667 req/s/gputrial-0012:2.4966666666666666 req/s,0.3120833333333333 req/s/gpu- Raw throughput gain:
5.29x - Per-GPU throughput gain:
2.65x
Compared with the looser TTFT study on the same 2026-03-27 window:
- looser-SLO best:
3.035 req/s,0.379375 req/s/gpu - tighter-SLO best:
2.4966666666666666 req/s,0.3120833333333333 req/s/gpu - throughput retained:
82.26% - throughput drop:
17.74%
Best-point latency:
TTFT mean/p50/p90/p95/p99 = 413.92 / 67.86 / 1456.32 / 2286.90 / 5326.23 ms
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP4/DP1/EP-off, max-num-batched-tokens=8192 |
0.4717 req/s, feasible |
trial-0002 |
TP4/DP2 |
probe-search failure |
trial-0003 |
TP8/DP1/EP-off |
1.9200 req/s, feasible |
trial-0004 |
TP8/DP1/EP8 |
launch fail |
trial-0005 |
trial-0003 + max-num-batched-tokens=6144 |
2.2517 req/s, feasible |
trial-0006 |
trial-0003 + max-num-batched-tokens=4096 |
infeasible |
trial-0007 |
trial-0003 + max-num-batched-tokens=5120 |
infeasible |
trial-0008 |
trial-0003 + max-num-batched-tokens=5632 |
infeasible |
trial-0009 |
trial-0005 + max-num-seqs=32 |
infeasible |
trial-0010 |
trial-0005 + max-num-seqs=48 |
infeasible |
trial-0011 |
trial-0005 + block-size=32 |
infeasible |
trial-0012 |
trial-0005 + max-num-seqs=48, block-size=32 |
2.4967 req/s, feasible, best |
Key insights
- This tuning is also on
2026-03-27, not a different day. The change is the tighter TTFT step SLO. - The best topology still moved to
TP8/DP1/no-EP; tighter TTFT did not change the topology conclusion. - Tighter TTFT did change the runtime sweet spot. The best runtime shape is not the looser-study
3712token batch, but6144 + max-num-seqs=48 + block-size=32. DP2andEPremained negative evidence in this stack:TP4/DP2failed during probing, andTP8 + EP8failed at launch.- Relative to the looser TTFT study on the same day, stricter TTFT costs about
17.7%throughput, but the tuned result still keeps a large margin over baseline.
Recommendation
For the tighter stepped TTFT SLO on thinking_w20260327_1000, use:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=6144max-num-seqs=48block-size=32
Keep the rest of the run_qwen235b.sh baseline unchanged.