5.5 KiB
5.5 KiB
qwen27b-chat-0-8k-7day-compare
qwen3.5-27b chat trace, 0~8k input bucket, tuned-best vs baseline cross-day compare on internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.
Setup
- Hardware:
dash1,8x H20 - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Engine: internal vLLM
- Baseline: empty patch over the study spec baseline, aligned to
~/run_qwen27b.shTP=1, DP=1 - Tuned best source:
trial-0004fromdash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology - Tuned best config:
tensor-parallel-size=2data-parallel-size=1
- Trace family:
chat - Input bucket:
0 <= input_length <= 8192 - Time range scanned:
2026-03-11to2026-03-17 - Available windows in this slot:
7chat_w20260311_1000chat_w20260312_1000chat_w20260313_1000chat_w20260314_1000chat_w20260315_1000chat_w20260316_1000chat_w20260317_1000
- Window duration:
600s(10:00-10:10) - Request mode:
chat - SLO:
- pass target:
95% TTFT <= 2000msfor<=4096input tokensTTFT <= 4000msfor<=32768input tokensTTFT <= 6000msfor>32768input tokensTPOT <= 50ms
- pass target:
- Search:
- binary search on
sampling_u max_probes = 6
- binary search on
- Proposal model for tuned source:
codex / gpt-5.4
Run assets
- Compare root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare - Summary:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json - Report:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md - Compare spec:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json - Tuned study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
Tuned-source result
- Best trial:
trial-0004 - Best config:
tensor-parallel-size=2data-parallel-size=1
- Best
sampling_u:0.013061523438 - Best request rate:
0.405 req/s - Best request rate per GPU:
0.2025 req/s/gpu - Best pass rate:
0.9629629629629629
Compared with the single-day baseline on chat_w20260311_1000:
trial-0001:0.035 req/s,0.035 req/s/gputrial-0004:0.405 req/s,0.2025 req/s/gpu- Raw throughput gain:
11.57x - Per-GPU throughput gain:
5.79x
12-trial summary
| Trial | Proposed config delta | Result |
|---|---|---|
trial-0001 |
baseline TP1/DP1 |
0.0350 req/s, 0.0350 req/s/gpu, feasible |
trial-0002 |
DP=2 |
0.1233 req/s, 0.0617 req/s/gpu, feasible |
trial-0003 |
DP=4 |
0.1567 req/s, 0.0392 req/s/gpu, feasible |
trial-0004 |
TP=2, DP=1 |
0.4050 req/s, 0.2025 req/s/gpu, feasible, best |
trial-0005 |
trial-0004 + max-num-batched-tokens=16384 |
infeasible |
trial-0006 |
trial-0004 + max-num-seqs=24 |
infeasible |
trial-0007 |
trial-0004 + max-num-batched-tokens=12288 |
infeasible |
trial-0008 |
trial-0004 + block-size=32 |
infeasible |
trial-0009 |
trial-0004 + gpu-memory-utilization=0.93 |
infeasible |
trial-0010 |
trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144 |
infeasible |
trial-0011 |
trial-0004 + enable-prefix-caching=false |
infeasible |
trial-0012 |
trial-0004 + block-size=128 |
infeasible |
Aggregate result
- Comparable wins: tuned
5, baseline0 - Incomparable windows:
2 - Baseline mean request rate:
0.046 req/s - Tuned mean request rate:
0.4723809523809524 req/s - Baseline mean request rate per GPU:
0.046 req/s/gpu - Tuned mean request rate per GPU:
0.2361904761904762 req/s/gpu
Per-window result
| Window | Date | Baseline req/s/gpu | Tuned req/s/gpu | Winner |
|---|---|---|---|---|
chat_w20260311_1000 |
2026-03-11 |
0.035 |
0.21416666666666667 |
tuned |
chat_w20260312_1000 |
2026-03-12 |
None |
0.28 |
incomparable |
chat_w20260313_1000 |
2026-03-13 |
0.03166666666666667 |
0.265 |
tuned |
chat_w20260314_1000 |
2026-03-14 |
0.021666666666666667 |
0.24083333333333334 |
tuned |
chat_w20260315_1000 |
2026-03-15 |
0.12166666666666667 |
0.23083333333333333 |
tuned |
chat_w20260316_1000 |
2026-03-16 |
0.02 |
0.2275 |
tuned |
chat_w20260317_1000 |
2026-03-17 |
None |
0.195 |
incomparable |
Key insights
- The tuned-source tuning itself was simple and topology-driven. The winning patch is only
TP1 -> TP2; later runtime-only tweaks all failed to beat it. - This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
- The two
incomparabledays are not execution failures. Baseline completed probing but never found a single feasiblesampling_uunder the target SLO, while tuned still found feasible operating points. - The tuned
TP=2, DP=1shape is materially more robust than theTP=1, DP=1baseline for this0~8kchat bucket. - The weekend windows do not break the result.
2026-03-14is another clear tuned win, and even on2026-03-15, where baseline is relatively stronger than other days, tuned still wins by about1.90xonreq/s/gpu. - The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.
Recommendation
For qwen27b chat 0~8k, keep using the tuned TP=2, DP=1 serving shape as the default candidate over the TP=1, DP=1 baseline, and treat cross-day robustness as confirmed on the full 7-day window set.