Files

4.0 KiB

qwen235b-thinking-prefill

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.

Setup

  • Hardware: dash0, 8x H20
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
  • Baseline topology: TP=4, DP=1, EP=1
  • Trace: thinking_w20260327_1000
  • Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
  • Window duration: 600s (10:00-10:10, 2026-03-27)
  • Request mode: chat
  • Replay override: min_tokens=max_tokens=1
  • SLO:
    • pass target: 95%
    • TTFT <= 3000ms for <=4096 input tokens
    • TTFT <= 6000ms for <=32768 input tokens
    • TTFT <= 9000ms for >32768 input tokens
  • Search:
    • sampling_u in [0, 0.125]
    • max_probes = 6
    • 12 trials total
  • Proposal model: codex / gpt-5.4

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run1-ttft-topology/state.json
  • Log: /home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen235b_prefill_thinking_run1_ttft_topology.log
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run1_ttft.json

Best result

  • Best trial: trial-0010
  • Best config:
    • tensor-parallel-size=8
    • data-parallel-size=1
    • enable-expert-parallel=false
    • max-num-batched-tokens=3712
  • Best sampling_u: 0.120422363281
  • Best request rate: 3.035 req/s
  • Best request rate per GPU: 0.379375 req/s/gpu
  • Best pass rate: 0.9533223503569467

Compared with baseline:

  • trial-0001: 0.8116666666666666 req/s, 0.20291666666666666 req/s/gpu
  • trial-0010: 3.035 req/s, 0.379375 req/s/gpu
  • Raw throughput gain: 3.74x
  • Per-GPU throughput gain: 1.87x

Best-point latency:

  • TTFT mean/p50/p90/p95/p99 = 863.84 / 253.58 / 2392.48 / 3154.26 / 5377.00 ms

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline TP4/DP1/EP-off, max-num-batched-tokens=8192 0.8117 req/s, feasible
trial-0002 DP=2, max-num-batched-tokens=4096 probe-time runtime failure
trial-0003 DP=2, max-num-batched-tokens=8192 probe-time runtime failure
trial-0004 EP=4, enable-expert-parallel=true launch fail
trial-0005 max-num-batched-tokens=4096 infeasible
trial-0006 TP=8, DP=1, max-num-batched-tokens=4096 2.8600 req/s, feasible
trial-0007 trial-0006 + max-num-batched-tokens=3072 infeasible
trial-0008 trial-0006 + max-num-batched-tokens=3584 2.9667 req/s, feasible
trial-0009 trial-0006 + max-num-batched-tokens=3328 infeasible
trial-0010 trial-0006 + max-num-batched-tokens=3712 3.0350 req/s, feasible, best
trial-0011 trial-0010 + max-num-batched-tokens=3840 infeasible
trial-0012 trial-0010 + max-num-batched-tokens=3776 infeasible

Key insights

  • The main win came from topology first, then local batch-shape refinement. TP4 -> TP8 was the key change.
  • TP4/DP2 was not just suboptimal; it was unstable at runtime under probing and should be treated as negative evidence for this stack.
  • EP=4 on the baseline TP4/DP1 path failed at launch with group_gemm's contiguous kernel requires deepgemm, so EP is currently not a viable direction here.
  • After switching to TP8/DP1/EP-off, the remaining gain came from tightening max-num-batched-tokens from 4096 into a narrow sweet spot around 3584~3712.
  • Too-small prefill batches (3072, 3328) and too-large ones (3776, 3840) both hurt the TTFT tail enough to lose the 95% target.

Current recommendation

Use trial-0010 as the default serving shape for this workload:

  • tensor-parallel-size=8
  • data-parallel-size=1
  • enable-expert-parallel=false
  • max-num-batched-tokens=3712
  • keep the rest of the run_qwen235b.sh baseline unchanged