diff --git a/docs/qwen27b-chat-0-8k-7day-compare/README.md b/docs/qwen27b-chat-0-8k-7day-compare/README.md index 7fccd0b..c6d9eaf 100644 --- a/docs/qwen27b-chat-0-8k-7day-compare/README.md +++ b/docs/qwen27b-chat-0-8k-7day-compare/README.md @@ -15,10 +15,12 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day - Trace family: `chat` - Input bucket: `0 <= input_length <= 8192` - Time range scanned: `2026-03-11` to `2026-03-17` -- Available windows in this slot: `5` +- Available windows in this slot: `7` - `chat_w20260311_1000` - `chat_w20260312_1000` - `chat_w20260313_1000` + - `chat_w20260314_1000` + - `chat_w20260315_1000` - `chat_w20260316_1000` - `chat_w20260317_1000` - Window duration: `600s` (`10:00-10:10`) @@ -43,12 +45,12 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day ## Aggregate result -- Comparable wins: tuned `3`, baseline `0` +- Comparable wins: tuned `5`, baseline `0` - Incomparable windows: `2` -- Baseline mean request rate: `0.02888888888888889 req/s` -- Tuned mean request rate: `0.47700000000000004 req/s` -- Baseline mean request rate per GPU: `0.02888888888888889 req/s/gpu` -- Tuned mean request rate per GPU: `0.23850000000000002 req/s/gpu` +- Baseline mean request rate: `0.046 req/s` +- Tuned mean request rate: `0.4723809523809524 req/s` +- Baseline mean request rate per GPU: `0.046 req/s/gpu` +- Tuned mean request rate per GPU: `0.2361904761904762 req/s/gpu` ## Per-window result @@ -57,16 +59,19 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day | `chat_w20260311_1000` | `2026-03-11` | `0.035` | `0.21416666666666667` | `tuned` | | `chat_w20260312_1000` | `2026-03-12` | `None` | `0.28` | `incomparable` | | `chat_w20260313_1000` | `2026-03-13` | `0.03166666666666667` | `0.265` | `tuned` | -| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.23833333333333334` | `tuned` | +| `chat_w20260314_1000` | `2026-03-14` | `0.021666666666666667` | `0.24083333333333334` | `tuned` | +| `chat_w20260315_1000` | `2026-03-15` | `0.12166666666666667` | `0.23083333333333333` | `tuned` | +| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.2275` | `tuned` | | `chat_w20260317_1000` | `2026-03-17` | `None` | `0.195` | `incomparable` | ## Key insights -- This compare does not support the conclusion that the tuned config lacks generalization. On the available days, tuned wins every directly comparable window. +- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window. - The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points. - The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket. -- The throughput gap is large even after normalizing by GPU count, so this is not just a raw-card-count artifact. +- The weekend windows do not break the result. `2026-03-14` is another clear tuned win, and even on `2026-03-15`, where baseline is relatively stronger than other days, tuned still wins by about `1.90x` on `req/s/gpu`. +- The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact. ## Recommendation -For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the currently available windows. +For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the full 7-day window set.