Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)

The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
parent 4f45b546a1
commit 77af4ded2a
1 changed files with 19 additions and 0 deletions
--- a/docs/harness-ablation/stop-b-e2e-20260615.md
+++ b/docs/harness-ablation/stop-b-e2e-20260615.md
@@ -30,6 +30,25 @@ config, is the bound).

 Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.**

+> **⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only
+> the Stop-B *mechanics*.** Two confounds:
+> 1. **Trace-ceiling saturation.** TP2·DP2 and TP4 reached `best_sampling_u≈0.98`
+>    (still feasible after consuming ~the whole window), so their *true* peak
+>    per-GPU is higher than the 2.09 shown — we ran out of offered load to push
+>    them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48)
+>    found real boundaries. The `sampling_u` axis maxes at the full trace, so any
+>    config that sustains more than the window's offered rate cannot be measured.
+> 2. **Smoke regime.** This run inherited `replay_time_scale=0.1` +
+>    `max_requests_per_probe=512` (README: convergence test, *not* a benchmark) —
+>    compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling.
+>
+> The below-ceiling TP1 (2.90) > TP2 (2.21) ordering *may* be real for this model
+> (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP
+> adds all-reduce overhead with little benefit), which differs from the dense
+> Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark
+> needs `scale=1.0`, no cap, and enough offered-load headroom that strong configs
+> are not trace-saturated — see the 27B TP A/B follow-up.
+
 ## Phase-5 acceptance

 - **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole