4.0 KiB
4.0 KiB
qwen235b-thinking-prefill
qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.
Setup
- Hardware:
dash0,8x H20 - Model:
/home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717 - Engine: internal vLLM, baseline aligned to
~/run_qwen235b.sh - Baseline topology:
TP=4, DP=1, EP=1 - Trace:
thinking_w20260327_1000 - Trace source:
trace_windows/traces/thinking_w20260327_1000.jsonl - Window duration:
600s(10:00-10:10,2026-03-27) - Request mode:
chat - Replay override:
min_tokens=max_tokens=1 - SLO:
- pass target:
95% TTFT <= 3000msfor<=4096input tokensTTFT <= 6000msfor<=32768input tokensTTFT <= 9000msfor>32768input tokens
- pass target:
- Search:
sampling_u in [0, 0.125]max_probes = 612trials total
- Proposal model:
codex / gpt-5.4
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology/state.json - Log:
/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_prefill_thinking_run1_ttft_topology.log - Spec:
/home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run1_ttft.json
Best result
- Best trial:
trial-0010 - Best config:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=3712
- Best
sampling_u:0.120422363281 - Best request rate:
3.035 req/s - Best request rate per GPU:
0.379375 req/s/gpu - Best pass rate:
0.9533223503569467
Compared with baseline:
trial-0001:0.8116666666666666 req/s,0.20291666666666666 req/s/gputrial-0010:3.035 req/s,0.379375 req/s/gpu- Raw throughput gain:
3.74x - Per-GPU throughput gain:
1.87x
Best-point latency:
TTFT mean/p50/p90/p95/p99 = 863.84 / 253.58 / 2392.48 / 3154.26 / 5377.00 ms
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP4/DP1/EP-off, max-num-batched-tokens=8192 |
0.8117 req/s, feasible |
trial-0002 |
DP=2, max-num-batched-tokens=4096 |
probe-time runtime failure |
trial-0003 |
DP=2, max-num-batched-tokens=8192 |
probe-time runtime failure |
trial-0004 |
EP=4, enable-expert-parallel=true |
launch fail |
trial-0005 |
max-num-batched-tokens=4096 |
infeasible |
trial-0006 |
TP=8, DP=1, max-num-batched-tokens=4096 |
2.8600 req/s, feasible |
trial-0007 |
trial-0006 + max-num-batched-tokens=3072 |
infeasible |
trial-0008 |
trial-0006 + max-num-batched-tokens=3584 |
2.9667 req/s, feasible |
trial-0009 |
trial-0006 + max-num-batched-tokens=3328 |
infeasible |
trial-0010 |
trial-0006 + max-num-batched-tokens=3712 |
3.0350 req/s, feasible, best |
trial-0011 |
trial-0010 + max-num-batched-tokens=3840 |
infeasible |
trial-0012 |
trial-0010 + max-num-batched-tokens=3776 |
infeasible |
Key insights
- The main win came from topology first, then local batch-shape refinement.
TP4 -> TP8was the key change. TP4/DP2was not just suboptimal; it was unstable at runtime under probing and should be treated as negative evidence for this stack.EP=4on the baselineTP4/DP1path failed at launch withgroup_gemm's contiguous kernel requires deepgemm, so EP is currently not a viable direction here.- After switching to
TP8/DP1/EP-off, the remaining gain came from tighteningmax-num-batched-tokensfrom4096into a narrow sweet spot around3584~3712. - Too-small prefill batches (
3072,3328) and too-large ones (3776,3840) both hurt the TTFT tail enough to lose the95%target.
Current recommendation
Use trial-0010 as the default serving shape for this workload:
tensor-parallel-size=8data-parallel-size=1enable-expert-parallel=falsemax-num-batched-tokens=3712- keep the rest of the
run_qwen235b.shbaseline unchanged