Files
aituner/docs/qwen27b-chat-pd-colocation

qwen27b-chat-pd-colocation

qwen3.5-27b chat trace, 0~8k input bucket, internal vLLM (/usr/local/bin/vllm), baseline aligned to ~/run_qwen27b.sh, compared by request_rate_per_gpu.

Setup

  • Hardware: dash0, 8x H20
  • Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
  • Engine: internal vLLM, PD-colocation baseline from ~/run_qwen27b.sh
  • Baseline topology: TP=1, DP=1, EP=1
  • Trace: chat_w20260311_1000
  • Trace source: trace_windows/traces/chat_w20260311_1000.jsonl
  • Window duration: 600s (10:00-10:10, 2026-03-11)
  • Request mode: chat
  • Input bucket: 0 <= input_length <= 8192
  • SLO:
    • pass target: 95%
    • TTFT <= 2000ms for <=4096 input tokens
    • TTFT <= 4000ms for <=32768 input tokens
    • TTFT <= 6000ms for >32768 input tokens
    • TPOT <= 50ms
  • Search:
    • sampling_u in [0, 0.0625]
    • max_probes = 6
    • 12 trials total
  • Proposal model: codex / gpt-5.4

Run assets

  • Study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
  • State: /home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology/state.json
  • Log: /home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.log
  • Spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/specs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.json

Best result

  • Best trial: trial-0004
  • Best config:
    • tensor-parallel-size=2
    • data-parallel-size=1
  • Best sampling_u: 0.013061523438
  • Best request rate: 0.405 req/s
  • Best request rate per GPU: 0.2025 req/s/gpu
  • Best pass rate: 0.9629629629629629

Compared with baseline:

  • trial-0001: 0.035 req/s, 0.035 req/s/gpu
  • trial-0004: 0.405 req/s, 0.2025 req/s/gpu
  • Raw throughput gain: 11.57x
  • Per-GPU throughput gain: 5.79x

12-trial summary

Trial Proposed config delta Result
trial-0001 baseline TP1/DP1 0.0350 req/s, 0.0350 req/s/gpu, feasible
trial-0002 DP=2 0.1233 req/s, 0.0617 req/s/gpu, feasible
trial-0003 DP=4 0.1567 req/s, 0.0392 req/s/gpu, feasible
trial-0004 TP=2, DP=1 0.4050 req/s, 0.2025 req/s/gpu, feasible, best
trial-0005 trial-0004 + max-num-batched-tokens=16384 infeasible
trial-0006 trial-0004 + max-num-seqs=24 infeasible
trial-0007 trial-0004 + max-num-batched-tokens=12288 infeasible
trial-0008 trial-0004 + block-size=32 infeasible
trial-0009 trial-0004 + gpu-memory-utilization=0.93 infeasible
trial-0010 trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144 infeasible
trial-0011 trial-0004 + enable-prefix-caching=false infeasible
trial-0012 trial-0004 + block-size=128 infeasible

Key insights

  • The baseline must be the real ~/run_qwen27b.sh TP=1 shape. Under that correct baseline, TP=2, DP=1 is clearly better on both raw throughput and request_rate_per_gpu.
  • Pure DP scaling helped from DP1 -> DP2, but DP4 already lost per-GPU efficiency. The main win came from TP2, not from adding more replicas.
  • After topology settled at TP2/DP1, the remaining bottleneck was TTFT tail, not TPOT. Later runtime-only trials generally failed around 0.435 req/s with pass_rate ~= 0.89, while TPOT p95 stayed acceptable and TTFT p95 stayed near 2.5s~3.0s.
  • For this 0~8k chat bucket, the useful topology search space was small but important. Without per-topology sampling_u search isolation, this result would have been easy to miss.

Current recommendation

Use trial-0004 as the default serving shape for this workload:

  • tensor-parallel-size=2
  • data-parallel-size=1
  • keep the rest of the run_qwen27b.sh baseline unchanged