qwen27b-chat-0-8k-7day-compare

qwen3.5-27b chat trace, 0~8k input bucket, tuned-best vs baseline cross-day compare on internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.

Setup

Hardware: dash1, 8x H20
Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
Engine: internal vLLM
Baseline: empty patch over the study spec baseline, aligned to ~/run_qwen27b.sh TP=1, DP=1
Tuned best source: trial-0004 from dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
Tuned best config:
- tensor-parallel-size=2
- data-parallel-size=1
Trace family: chat
Input bucket: 0 <= input_length <= 8192
Time range scanned: 2026-03-11 to 2026-03-17
Available windows in this slot: 7
- chat_w20260311_1000
- chat_w20260312_1000
- chat_w20260313_1000
- chat_w20260314_1000
- chat_w20260315_1000
- chat_w20260316_1000
- chat_w20260317_1000
Window duration: 600s (10:00-10:10)
Request mode: chat
SLO:
- pass target: 95%
- TTFT <= 2000ms for <=4096 input tokens
- TTFT <= 4000ms for <=32768 input tokens
- TTFT <= 6000ms for >32768 input tokens
- TPOT <= 50ms
Search:
- binary search on sampling_u
- max_probes = 6
Proposal model for tuned source: codex / gpt-5.4

Run assets

Compare root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare
Summary: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json
Report: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md
Compare spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json
Tuned study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology

Tuned-source result

Best trial: trial-0004
Best config:
- tensor-parallel-size=2
- data-parallel-size=1
Best sampling_u: 0.013061523438
Best request rate: 0.405 req/s
Best request rate per GPU: 0.2025 req/s/gpu
Best pass rate: 0.9629629629629629

Compared with the single-day baseline on chat_w20260311_1000:

trial-0001: 0.035 req/s, 0.035 req/s/gpu
trial-0004: 0.405 req/s, 0.2025 req/s/gpu
Raw throughput gain: 11.57x
Per-GPU throughput gain: 5.79x

12-trial summary

Trial	Proposed config delta	Result
`trial-0001`	baseline `TP1/DP1`	`0.0350 req/s`, `0.0350 req/s/gpu`, feasible
`trial-0002`	`DP=2`	`0.1233 req/s`, `0.0617 req/s/gpu`, feasible
`trial-0003`	`DP=4`	`0.1567 req/s`, `0.0392 req/s/gpu`, feasible
`trial-0004`	`TP=2, DP=1`	`0.4050 req/s`, `0.2025 req/s/gpu`, feasible, best
`trial-0005`	`trial-0004 + max-num-batched-tokens=16384`	infeasible
`trial-0006`	`trial-0004 + max-num-seqs=24`	infeasible
`trial-0007`	`trial-0004 + max-num-batched-tokens=12288`	infeasible
`trial-0008`	`trial-0004 + block-size=32`	infeasible
`trial-0009`	`trial-0004 + gpu-memory-utilization=0.93`	infeasible
`trial-0010`	`trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144`	infeasible
`trial-0011`	`trial-0004 + enable-prefix-caching=false`	infeasible
`trial-0012`	`trial-0004 + block-size=128`	infeasible

Aggregate result

Comparable wins: tuned 5, baseline 0
Incomparable windows: 2
Baseline mean request rate: 0.046 req/s
Tuned mean request rate: 0.4723809523809524 req/s
Baseline mean request rate per GPU: 0.046 req/s/gpu
Tuned mean request rate per GPU: 0.2361904761904762 req/s/gpu

Per-window result

Window	Date	Baseline req/s/gpu	Tuned req/s/gpu	Winner
`chat_w20260311_1000`	`2026-03-11`	`0.035`	`0.21416666666666667`	`tuned`
`chat_w20260312_1000`	`2026-03-12`	`None`	`0.28`	`incomparable`
`chat_w20260313_1000`	`2026-03-13`	`0.03166666666666667`	`0.265`	`tuned`
`chat_w20260314_1000`	`2026-03-14`	`0.021666666666666667`	`0.24083333333333334`	`tuned`
`chat_w20260315_1000`	`2026-03-15`	`0.12166666666666667`	`0.23083333333333333`	`tuned`
`chat_w20260316_1000`	`2026-03-16`	`0.02`	`0.2275`	`tuned`
`chat_w20260317_1000`	`2026-03-17`	`None`	`0.195`	`incomparable`

Key insights

The tuned-source tuning itself was simple and topology-driven. The winning patch is only TP1 -> TP2; later runtime-only tweaks all failed to beat it.
This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
The two incomparable days are not execution failures. Baseline completed probing but never found a single feasible sampling_u under the target SLO, while tuned still found feasible operating points.
The tuned TP=2, DP=1 shape is materially more robust than the TP=1, DP=1 baseline for this 0~8k chat bucket.
The weekend windows do not break the result. 2026-03-14 is another clear tuned win, and even on 2026-03-15, where baseline is relatively stronger than other days, tuned still wins by about 1.90x on req/s/gpu.
The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.

Recommendation

For qwen27b chat 0~8k, keep using the tuned TP=2, DP=1 serving shape as the default candidate over the TP=1, DP=1 baseline, and treat cross-day robustness as confirmed on the full 7-day window set.

5.5 KiB Raw Blame History