Files
aituner/docs/qwen27b-chat-0-8k-7day-compare

qwen27b-chat-0-8k-7day-compare

qwen3.5-27b chat trace, 0~8k input bucket, tuned-best vs baseline cross-day compare on internal vLLM (/usr/local/bin/vllm), compared by request_rate_per_gpu.

Setup

  • Hardware: dash1, 8x H20
  • Model: /home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal
  • Engine: internal vLLM
  • Baseline: empty patch over the study spec baseline, aligned to ~/run_qwen27b.sh TP=1, DP=1
  • Tuned best source: trial-0004 from dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology
  • Tuned best config:
    • tensor-parallel-size=2
    • data-parallel-size=1
  • Trace family: chat
  • Input bucket: 0 <= input_length <= 8192
  • Time range scanned: 2026-03-11 to 2026-03-17
  • Available windows in this slot: 7
    • chat_w20260311_1000
    • chat_w20260312_1000
    • chat_w20260313_1000
    • chat_w20260314_1000
    • chat_w20260315_1000
    • chat_w20260316_1000
    • chat_w20260317_1000
  • Window duration: 600s (10:00-10:10)
  • Request mode: chat
  • SLO:
    • pass target: 95%
    • TTFT <= 2000ms for <=4096 input tokens
    • TTFT <= 4000ms for <=32768 input tokens
    • TTFT <= 6000ms for >32768 input tokens
    • TPOT <= 50ms
  • Search:
    • binary search on sampling_u
    • max_probes = 6

Run assets

  • Compare root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare
  • Summary: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json
  • Report: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md
  • Compare spec: /home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json
  • Tuned study root: /home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology

Aggregate result

  • Comparable wins: tuned 5, baseline 0
  • Incomparable windows: 2
  • Baseline mean request rate: 0.046 req/s
  • Tuned mean request rate: 0.4723809523809524 req/s
  • Baseline mean request rate per GPU: 0.046 req/s/gpu
  • Tuned mean request rate per GPU: 0.2361904761904762 req/s/gpu

Per-window result

Window Date Baseline req/s/gpu Tuned req/s/gpu Winner
chat_w20260311_1000 2026-03-11 0.035 0.21416666666666667 tuned
chat_w20260312_1000 2026-03-12 None 0.28 incomparable
chat_w20260313_1000 2026-03-13 0.03166666666666667 0.265 tuned
chat_w20260314_1000 2026-03-14 0.021666666666666667 0.24083333333333334 tuned
chat_w20260315_1000 2026-03-15 0.12166666666666667 0.23083333333333333 tuned
chat_w20260316_1000 2026-03-16 0.02 0.2275 tuned
chat_w20260317_1000 2026-03-17 None 0.195 incomparable

Key insights

  • This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
  • The two incomparable days are not execution failures. Baseline completed probing but never found a single feasible sampling_u under the target SLO, while tuned still found feasible operating points.
  • The tuned TP=2, DP=1 shape is materially more robust than the TP=1, DP=1 baseline for this 0~8k chat bucket.
  • The weekend windows do not break the result. 2026-03-14 is another clear tuned win, and even on 2026-03-15, where baseline is relatively stronger than other days, tuned still wins by about 1.90x on req/s/gpu.
  • The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.

Recommendation

For qwen27b chat 0~8k, keep using the tuned TP=2, DP=1 serving shape as the default candidate over the TP=1, DP=1 baseline, and treat cross-day robustness as confirmed on the full 7-day window set.