qwen27b-chat-pd-colocation
qwen3.5-27b chat trace, 0~8k input bucket, internal vLLM (/usr/local/bin/vllm), baseline aligned to ~/run_qwen27b.sh, compared by request_rate_per_gpu.
Setup
- Hardware:
dash0,8x H20 - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Engine: internal vLLM, PD-colocation baseline from
~/run_qwen27b.sh - Baseline topology:
TP=1, DP=1, EP=1 - Trace:
chat_w20260311_1000 - Trace source:
trace_windows/traces/chat_w20260311_1000.jsonl - Window duration:
600s(10:00-10:10,2026-03-11) - Request mode:
chat - Input bucket:
0 <= input_length <= 8192 - SLO:
- pass target:
95% TTFT <= 2000msfor<=4096input tokensTTFT <= 4000msfor<=32768input tokensTTFT <= 6000msfor>32768input tokensTPOT <= 50ms
- pass target:
- Search:
sampling_u in [0, 0.0625]max_probes = 612trials total
- Proposal model:
codex / gpt-5.4
Run assets
- Study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology - State:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology/state.json - Log:
/home/admin/cpfs/wjh/aituner/aituner/logs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.log - Spec:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/specs/dash0_qwen27b_tight_slo_run9_0_8k_codex_topology.json
Best result
- Best trial:
trial-0004 - Best config:
tensor-parallel-size=2data-parallel-size=1
- Best
sampling_u:0.013061523438 - Best request rate:
0.405 req/s - Best request rate per GPU:
0.2025 req/s/gpu - Best pass rate:
0.9629629629629629
Compared with baseline:
trial-0001:0.035 req/s,0.035 req/s/gputrial-0004:0.405 req/s,0.2025 req/s/gpu- Raw throughput gain:
11.57x - Per-GPU throughput gain:
5.79x
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP1/DP1 |
0.0350 req/s, 0.0350 req/s/gpu, feasible |
trial-0002 |
DP=2 |
0.1233 req/s, 0.0617 req/s/gpu, feasible |
trial-0003 |
DP=4 |
0.1567 req/s, 0.0392 req/s/gpu, feasible |
trial-0004 |
TP=2, DP=1 |
0.4050 req/s, 0.2025 req/s/gpu, feasible, best |
trial-0005 |
trial-0004 + max-num-batched-tokens=16384 |
infeasible |
trial-0006 |
trial-0004 + max-num-seqs=24 |
infeasible |
trial-0007 |
trial-0004 + max-num-batched-tokens=12288 |
infeasible |
trial-0008 |
trial-0004 + block-size=32 |
infeasible |
trial-0009 |
trial-0004 + gpu-memory-utilization=0.93 |
infeasible |
trial-0010 |
trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144 |
infeasible |
trial-0011 |
trial-0004 + enable-prefix-caching=false |
infeasible |
trial-0012 |
trial-0004 + block-size=128 |
infeasible |
Key insights
- The baseline must be the real
~/run_qwen27b.shTP=1shape. Under that correct baseline,TP=2, DP=1is clearly better on both raw throughput andrequest_rate_per_gpu. - Pure DP scaling helped from
DP1 -> DP2, butDP4already lost per-GPU efficiency. The main win came fromTP2, not from adding more replicas. - After topology settled at
TP2/DP1, the remaining bottleneck wasTTFTtail, notTPOT. Later runtime-only trials generally failed around0.435 req/swithpass_rate ~= 0.89, whileTPOT p95stayed acceptable andTTFT p95stayed near2.5s~3.0s. - For this
0~8kchat bucket, the useful topology search space was small but important. Without per-topologysampling_usearch isolation, this result would have been easy to miss.
Current recommendation
Use trial-0004 as the default serving shape for this workload:
tensor-parallel-size=2data-parallel-size=1- keep the rest of the
run_qwen27b.shbaseline unchanged