docs: update qwen27b 7-day compare

2026-04-13 09:16:31 +08:00
parent 4625fba487
commit a1b96f7dd2
1 changed files with 15 additions and 10 deletions
--- a/docs/qwen27b-chat-0-8k-7day-compare/README.md
+++ b/docs/qwen27b-chat-0-8k-7day-compare/README.md
@@ -15,10 +15,12 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day
 - Trace family: `chat`
 - Input bucket: `0 <= input_length <= 8192`
 - Time range scanned: `2026-03-11` to `2026-03-17`
- Available windows in this slot: `5`
+- Available windows in this slot: `7`
  - `chat_w20260311_1000`
  - `chat_w20260312_1000`
  - `chat_w20260313_1000`
+  - `chat_w20260314_1000`
+  - `chat_w20260315_1000`
  - `chat_w20260316_1000`
  - `chat_w20260317_1000`
 - Window duration: `600s` (`10:00-10:10`)
@@ -43,12 +45,12 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day

 ## Aggregate result

- Comparable wins: tuned `3`, baseline `0`
+- Comparable wins: tuned `5`, baseline `0`
 - Incomparable windows: `2`
- Baseline mean request rate: `0.02888888888888889 req/s`
- Tuned mean request rate: `0.47700000000000004 req/s`
- Baseline mean request rate per GPU: `0.02888888888888889 req/s/gpu`
- Tuned mean request rate per GPU: `0.23850000000000002 req/s/gpu`
+- Baseline mean request rate: `0.046 req/s`
+- Tuned mean request rate: `0.4723809523809524 req/s`
+- Baseline mean request rate per GPU: `0.046 req/s/gpu`
+- Tuned mean request rate per GPU: `0.2361904761904762 req/s/gpu`

 ## Per-window result

@@ -57,16 +59,19 @@ qwen3.5-27b `chat` trace, `0~8k` input bucket, tuned-best vs baseline cross-day
 | `chat_w20260311_1000` | `2026-03-11` | `0.035` | `0.21416666666666667` | `tuned` |
 | `chat_w20260312_1000` | `2026-03-12` | `None` | `0.28` | `incomparable` |
 | `chat_w20260313_1000` | `2026-03-13` | `0.03166666666666667` | `0.265` | `tuned` |
-| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.23833333333333334` | `tuned` |
+| `chat_w20260314_1000` | `2026-03-14` | `0.021666666666666667` | `0.24083333333333334` | `tuned` |
+| `chat_w20260315_1000` | `2026-03-15` | `0.12166666666666667` | `0.23083333333333333` | `tuned` |
+| `chat_w20260316_1000` | `2026-03-16` | `0.02` | `0.2275` | `tuned` |
 | `chat_w20260317_1000` | `2026-03-17` | `None` | `0.195` | `incomparable` |

 ## Key insights

- This compare does not support the conclusion that the tuned config lacks generalization. On the available days, tuned wins every directly comparable window.
+- This compare does not support the conclusion that the tuned config lacks generalization. Across the full 7-day slice, tuned wins every directly comparable window.
 - The two `incomparable` days are not execution failures. Baseline completed probing but never found a single feasible `sampling_u` under the target SLO, while tuned still found feasible operating points.
 - The tuned `TP=2, DP=1` shape is materially more robust than the `TP=1, DP=1` baseline for this `0~8k` chat bucket.
- The throughput gap is large even after normalizing by GPU count, so this is not just a raw-card-count artifact.
+- The weekend windows do not break the result. `2026-03-14` is another clear tuned win, and even on `2026-03-15`, where baseline is relatively stronger than other days, tuned still wins by about `1.90x` on `req/s/gpu`.
+- The throughput gap remains large even after normalizing by GPU count, so this is not just a raw-card-count artifact.

 ## Recommendation

-For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the currently available windows.
+For `qwen27b chat 0~8k`, keep using the tuned `TP=2, DP=1` serving shape as the default candidate over the `TP=1, DP=1` baseline, and treat cross-day robustness as confirmed on the full 7-day window set.