3.9 KiB
3.9 KiB
qwen27b-chat-0-8k-7day-compare
qwen3.5-27b chat trace, 0~8k input bucket, tuned-best vs baseline cross-day compare on internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.
Setup
- Hardware:
dash1,8x H20 - Model:
/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal - Engine: internal vLLM
- Baseline: empty patch over the study spec baseline, aligned to
~/run_qwen27b.shTP=1, DP=1 - Tuned best source:
trial-0004fromdash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology - Tuned best config:
tensor-parallel-size=2data-parallel-size=1
- Trace family:
chat - Input bucket:
0 <= input_length <= 8192 - Time range scanned:
2026-03-11to2026-03-17 - Available windows in this slot:
7chat_w20260311_1000chat_w20260312_1000chat_w20260313_1000chat_w20260314_1000chat_w20260315_1000chat_w20260316_1000chat_w20260317_1000
- Window duration:
600s(10:00-10:10) - Request mode:
chat - SLO:
- pass target:
95% TTFT <= 2000msfor<=4096input tokensTTFT <= 4000msfor<=32768input tokensTTFT <= 6000msfor>32768input tokensTPOT <= 50ms
- pass target:
- Search:
- binary search on
sampling_u max_probes = 6
- binary search on
Run assets
- Compare root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare - Summary:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json - Report:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md - Compare spec:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json - Tuned study root:
/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
Aggregate result
- Comparable wins: tuned
5, baseline0 - Incomparable windows:
2 - Baseline mean request rate:
0.046 req/s - Tuned mean request rate:
0.4723809523809524 req/s - Baseline mean request rate per GPU:
0.046 req/s/gpu - Tuned mean request rate per GPU:
0.2361904761904762 req/s/gpu
Per-window result
| Window | Date | Baseline req/s/gpu | Tuned req/s/gpu | Winner |
|---|---|---|---|---|
chat_w20260311_1000 |
2026-03-11 |
0.035 |
0.21416666666666667 |
tuned |
chat_w20260312_1000 |
2026-03-12 |
None |
0.28 |
incomparable |
chat_w20260313_1000 |
2026-03-13 |
0.03166666666666667 |
0.265 |
tuned |
chat_w20260314_1000 |
2026-03-14 |
0.021666666666666667 |
0.24083333333333334 |
tuned |
chat_w20260315_1000 |
2026-03-15 |
0.12166666666666667 |
0.23083333333333333 |
tuned |
chat_w20260316_1000 |
2026-03-16 |
0.02 |
0.2275 |
tuned |
chat_w20260317_1000 |
2026-03-17 |
None |
0.195 |
incomparable |
Key insights
- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
- The two
incomparabledays are not execution failures. Baseline completed probing but never found a single feasiblesampling_uunder the target SLO, while tuned still found feasible operating points. - The tuned
TP=2, DP=1shape is materially more robust than theTP=1, DP=1baseline for this0~8kchat bucket. - The weekend windows do not break the result.
2026-03-14is another clear tuned win, and even on2026-03-15, where baseline is relatively stronger than other days, tuned still wins by about1.90xonreq/s/gpu. - The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.
Recommendation
For qwen27b chat 0~8k, keep using the tuned TP=2, DP=1 serving shape as the default candidate over the TP=1, DP=1 baseline, and treat cross-day robustness as confirmed on the full 7-day window set.