Files
aituner/docs/qwen235b-thinking-prefill/ttft-tight-0327.md

4.2 KiB

qwen235b-thinking-prefill-ttft-tight-0327

qwen3-235b-a22b thinking trace, prefill-only replay with output_length=1, internal vLLM (/usr/local/bin/vllm), tuned on thinking_w20260327_1000 under tighter stepped TTFT SLO.

Setup

  • Hardware: dash0, 8x H20
  • Model: /home/admin/resource/model/464482ce.qwen3-235b-a22b/256k-0717
  • Engine: internal vLLM, baseline aligned to ~/run_qwen235b.sh
  • Baseline topology: TP=4, DP=1, EP=1
  • Trace: thinking_w20260327_1000
  • Trace source: trace_windows/traces/thinking_w20260327_1000.jsonl
  • Window duration: 600s (10:00-10:10, 2026-03-27)
  • Request mode: chat
  • Replay override: min_tokens=max_tokens=1
  • SLO:
    • pass target: 95%
    • TTFT <= 2000ms for <=8191 input tokens
    • TTFT <= 4000ms for <=32767 input tokens
    • TTFT <= 6000ms for >32767 input tokens
  • Search:
    • sampling_u in [0, 0.125]
    • max_probes = 6
    • 12 trials total
  • Proposal model: codex / gpt-5.4

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-prefill/dash0-qwen235b-prefill-thinking-run2-ttft-tight-topology/state.json
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/configs/examples/dash0_qwen235b_prefill_thinking_run2_ttft_tight.json

Best result

  • Best trial: trial-0012
  • Best config:
    • tensor-parallel-size=8
    • data-parallel-size=1
    • enable-expert-parallel=false
    • max-num-batched-tokens=6144
    • max-num-seqs=48
    • block-size=32
  • Best sampling_u: 0.098106384277
  • Best request rate: 2.4966666666666666 req/s
  • Best request rate per GPU: 0.3120833333333333 req/s/gpu
  • Best pass rate: 0.9506008010680908

Compared with baseline:

  • trial-0001: 0.4716666666666667 req/s, 0.11791666666666667 req/s/gpu
  • trial-0012: 2.4966666666666666 req/s, 0.3120833333333333 req/s/gpu
  • Raw throughput gain: 5.29x
  • Per-GPU throughput gain: 2.65x

Compared with the looser TTFT study on the same 2026-03-27 window:

  • looser-SLO best: 3.035 req/s, 0.379375 req/s/gpu
  • tighter-SLO best: 2.4966666666666666 req/s, 0.3120833333333333 req/s/gpu
  • throughput retained: 82.26%
  • throughput drop: 17.74%

Best-point latency:

  • TTFT mean/p50/p90/p95/p99 = 413.92 / 67.86 / 1456.32 / 2286.90 / 5326.23 ms

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline TP4/DP1/EP-off, max-num-batched-tokens=8192 0.4717 req/s, feasible
trial-0002 TP4/DP2 probe-search failure
trial-0003 TP8/DP1/EP-off 1.9200 req/s, feasible
trial-0004 TP8/DP1/EP8 launch fail
trial-0005 trial-0003 + max-num-batched-tokens=6144 2.2517 req/s, feasible
trial-0006 trial-0003 + max-num-batched-tokens=4096 infeasible
trial-0007 trial-0003 + max-num-batched-tokens=5120 infeasible
trial-0008 trial-0003 + max-num-batched-tokens=5632 infeasible
trial-0009 trial-0005 + max-num-seqs=32 infeasible
trial-0010 trial-0005 + max-num-seqs=48 infeasible
trial-0011 trial-0005 + block-size=32 infeasible
trial-0012 trial-0005 + max-num-seqs=48, block-size=32 2.4967 req/s, feasible, best

Key insights

  • This tuning is also on 2026-03-27, not a different day. The change is the tighter TTFT step SLO.
  • The best topology still moved to TP8/DP1/no-EP; tighter TTFT did not change the topology conclusion.
  • Tighter TTFT did change the runtime sweet spot. The best runtime shape is not the looser-study 3712 token batch, but 6144 + max-num-seqs=48 + block-size=32.
  • DP2 and EP remained negative evidence in this stack: TP4/DP2 failed during probing, and TP8 + EP8 failed at launch.
  • Relative to the looser TTFT study on the same day, stricter TTFT costs about 17.7% throughput, but the tuned result still keeps a large margin over baseline.

Recommendation

For the tighter stepped TTFT SLO on thinking_w20260327_1000, use:

  • tensor-parallel-size=8
  • data-parallel-size=1
  • enable-expert-parallel=false
  • max-num-batched-tokens=6144
  • max-num-seqs=48
  • block-size=32

Keep the rest of the run_qwen235b.sh baseline unchanged.