diff --git a/docs/qwen27b-chat-0-8k-7day-compare/README.md b/docs/qwen27b-chat-0-8k-7day-compare/README.md new file mode 100644 index 0000000..7fccd0b --- /dev/null +++ b/docs/qwen27b-chat-0-8k-7day-compare/README.md @@ -0,0 +1,72 @@ +# qwen27b-chat-0-8k-7day-compare + +qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day compare on internal vLLM (`/usr/local/bin/vllm`), compared by `request_rate_per_gpu`. + +## Setup + +- Hardware: `dash1`, `8x H20` +- Model: `/home/admin/resource/model/464482ce/qwen3.5-27b/256k-0223-internal` +- Engine: internal vLLM +- Baseline: empty patch over the study spec baseline, aligned to `~/run_qwen27b.sh` `TP=1, DP=1` +- Tuned best source: `trial-0004` from `dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology` +- Tuned best config: + - `tensor-parallel-size=2` + - `data-parallel-size=1` +- Trace family: `chat` +- Input bucket: `0 <= input_length <= 8192` +- Time range scanned: `2026-03-11` to `2026-03-17` +- Available windows in this slot: `5` + - `chat_w20260311_1000` + - `chat_w20260312_1000` + - `chat_w20260313_1000` + - `chat_w20260316_1000` + - `chat_w20260317_1000` +- Window duration: `600s` (`10:00-10:10`) +- Request mode: `chat` +- SLO: + - pass target: `95%` + - `TTFT <= 2000ms` for `<=4096` input tokens + - `TTFT <= 4000ms` for `<=32768` input tokens + - `TTFT <= 6000ms` for `>32768` input tokens + - `TPOT <= 50ms` +- Search: + - binary search on `sampling_u` + - `max_probes = 6` + +## Run assets + +- Compare root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare` +- Summary: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/summary.json` +- Report: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/dash1-qwen27b-chat-0-8k-7days-compare/report.md` +- Compare spec: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-compare/specs/qwen27b_chat_0_8k_compare_dash1.json` +- Tuned study root: `/home/admin/cpfs/wjh/aituner/aituner/.aituner-tight/dash0-qwen27b-tight-slo-10min-run9-chat-0-8k-codex-topology` + +## Aggregate result + +- Comparable wins: tuned `3`, baseline `0` +- Incomparable windows: `2` +- Baseline mean request rate: `0.02888888888888889 req/s` +- Tuned mean request rate: `0.47700000000000004 req/s` +- Baseline mean request rate per GPU: `0.02888888888888889 req/s/gpu` +- Tuned mean request rate per GPU: `0.23850000000000002 req/s/gpu` + +## Per-window result + +| Window | Date | Baseline req/s/gpu | Tuned req/s/gpu | Winner | +| --- | --- | ---: | ---: | --- | +| `chat_w20260311_1000` | `2026-03-11` | `0.035` | `0.21416666666666667` | `tuned` | +| `chat_w20260312_1000` | `2026-03-12` | `None` | `0.28` | `incomparable` | +| `chat_w20260313_1000` | `2026-03-13` | `0.03166666666666667` | `0.265` | `tuned` | +| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.23833333333333334` | `tuned` | +| `chat_w20260317_1000` | `2026-03-17` | `None` | `0.195` | `incomparable` | + +## Key insights + +- This compare does not support the conclusion that the tuned config lacks generalization. On the available days, tuned wins every directly comparable window. +- The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points. +- The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket. +- The throughput gap is large even after normalizing by GPU count, so this is not just a raw-card-count artifact. + +## Recommendation + +For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the currently available windows.