Files
aituner/docs/qwen27b-chat-0-8k-7day-compare/README.md

115 lines
5.5 KiB
Markdown

# qwen27b-chat-0-8k-7day-compare
qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day compare on internal vLLM (`/usr/local/bin/vllm`), compared by `request_rate_per_gpu`.
## Setup
- Hardware: `dash1`, `8x H20`
- Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal`
- Engine: internal vLLM
- Baseline: empty patch over the study spec baseline, aligned to `~/run_qwen27b.sh` `TP=1, DP=1`
- Tuned best source: `trial-0004` from `dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
- Tuned best config:
- `tensor-parallel-size=2`
- `data-parallel-size=1`
- Trace family: `chat`
- Input bucket: `0 <= input_length <= 8192`
- Time range scanned: `2026-03-11` to `2026-03-17`
- Available windows in this slot: `7`
- `chat_w20260311_1000`
- `chat_w20260312_1000`
- `chat_w20260313_1000`
- `chat_w20260314_1000`
- `chat_w20260315_1000`
- `chat_w20260316_1000`
- `chat_w20260317_1000`
- Window duration: `600s` (`10:00-10:10`)
- Request mode: `chat`
- SLO:
- pass target: `95%`
- `TTFT <= 2000ms` for `<=4096` input tokens
- `TTFT <= 4000ms` for `<=32768` input tokens
- `TTFT <= 6000ms` for `>32768` input tokens
- `TPOT <= 50ms`
- Search:
- binary search on `sampling_u`
- `max_probes = 6`
- Proposal model for tuned source: `codex / gpt-5.4`
## Run assets
- Compare root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare`
- Summary: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json`
- Report: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md`
- Compare spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json`
- Tuned study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology`
## Tuned-source result
- Best trial: `trial-0004`
- Best config:
- `tensor-parallel-size=2`
- `data-parallel-size=1`
- Best `sampling_u`: `0.013061523438`
- Best request rate: `0.405 req/s`
- Best request rate per GPU: `0.2025 req/s/gpu`
- Best pass rate: `0.9629629629629629`
Compared with the single-day baseline on `chat_w20260311_1000`:
- `trial-0001`: `0.035 req/s`, `0.035 req/s/gpu`
- `trial-0004`: `0.405 req/s`, `0.2025 req/s/gpu`
- Raw throughput gain: `11.57x`
- Per-GPU throughput gain: `5.79x`
## 12-trial summary
| Trial | Proposed config delta | Result |
| --- | --- | --- |
| `trial-0001` | baseline `TP1/DP1` | `0.0350 req/s`, `0.0350 req/s/gpu`, feasible |
| `trial-0002` | `DP=2` | `0.1233 req/s`, `0.0617 req/s/gpu`, feasible |
| `trial-0003` | `DP=4` | `0.1567 req/s`, `0.0392 req/s/gpu`, feasible |
| `trial-0004` | `TP=2, DP=1` | `0.4050 req/s`, `0.2025 req/s/gpu`, feasible, best |
| `trial-0005` | `trial-0004 + max-num-batched-tokens=16384` | infeasible |
| `trial-0006` | `trial-0004 + max-num-seqs=24` | infeasible |
| `trial-0007` | `trial-0004 + max-num-batched-tokens=12288` | infeasible |
| `trial-0008` | `trial-0004 + block-size=32` | infeasible |
| `trial-0009` | `trial-0004 + gpu-memory-utilization=0.93` | infeasible |
| `trial-0010` | `trial-0004 + max-num-seqs=16, max-num-batched-tokens=6144` | infeasible |
| `trial-0011` | `trial-0004 + enable-prefix-caching=false` | infeasible |
| `trial-0012` | `trial-0004 + block-size=128` | infeasible |
## Aggregate result
- Comparable wins: tuned `5`, baseline `0`
- Incomparable windows: `2`
- Baseline mean request rate: `0.046 req/s`
- Tuned mean request rate: `0.4723809523809524 req/s`
- Baseline mean request rate per GPU: `0.046 req/s/gpu`
- Tuned mean request rate per GPU: `0.2361904761904762 req/s/gpu`
## Per-window result
| Window | Date | Baseline req/s/gpu | Tuned req/s/gpu | Winner |
| --- | --- | ---: | ---: | --- |
| `chat_w20260311_1000` | `2026-03-11` | `0.035` | `0.21416666666666667` | `tuned` |
| `chat_w20260312_1000` | `2026-03-12` | `None` | `0.28` | `incomparable` |
| `chat_w20260313_1000` | `2026-03-13` | `0.03166666666666667` | `0.265` | `tuned` |
| `chat_w20260314_1000` | `2026-03-14` | `0.021666666666666667` | `0.24083333333333334` | `tuned` |
| `chat_w20260315_1000` | `2026-03-15` | `0.12166666666666667` | `0.23083333333333333` | `tuned` |
| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.2275` | `tuned` |
| `chat_w20260317_1000` | `2026-03-17` | `None` | `0.195` | `incomparable` |
## Key insights
- The tuned-source tuning itself was simple and topology-driven. The winning patch is only `TP1 -> TP2`; later runtime-only tweaks all failed to beat it.
- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
- The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points.
- The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket.
- The weekend windows do not break the result. `2026-03-14` is another clear tuned win, and even on `2026-03-15`, where baseline is relatively stronger than other days, tuned still wins by about `1.90x` on `req/s/gpu`.
- The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.
## Recommendation
For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the full 7-day window set.