From 77af4ded2a746514a4c6f7433e4b4f33fdab422b Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Mon, 15 Jun 2026 18:40:38 +0800
Subject: [PATCH] Flag Stop-B e2e per-GPU trajectory as non-benchmark
 (saturation + smoke regime)

The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/harness-ablation/stop-b-e2e-20260615.md | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/docs/harness-ablation/stop-b-e2e-20260615.md b/docs/harness-ablation/stop-b-e2e-20260615.md
index 91ec0e9..e08b5f0 100644
--- a/docs/harness-ablation/stop-b-e2e-20260615.md
+++ b/docs/harness-ablation/stop-b-e2e-20260615.md
@@ -30,6 +30,25 @@ config, is the bound).
 
 Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.**
 
+> **⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only
+> the Stop-B *mechanics*.** Two confounds:
+> 1. **Trace-ceiling saturation.** TP2·DP2 and TP4 reached `best_sampling_u≈0.98`
+>    (still feasible after consuming ~the whole window), so their *true* peak
+>    per-GPU is higher than the 2.09 shown — we ran out of offered load to push
+>    them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48)
+>    found real boundaries. The `sampling_u` axis maxes at the full trace, so any
+>    config that sustains more than the window's offered rate cannot be measured.
+> 2. **Smoke regime.** This run inherited `replay_time_scale=0.1` +
+>    `max_requests_per_probe=512` (README: convergence test, *not* a benchmark) —
+>    compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling.
+>
+> The below-ceiling TP1 (2.90) > TP2 (2.21) ordering *may* be real for this model
+> (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP
+> adds all-reduce overhead with little benefit), which differs from the dense
+> Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark
+> needs `scale=1.0`, no cap, and enough offered-load headroom that strong configs
+> are not trace-saturated — see the 27B TP A/B follow-up.
+
 ## Phase-5 acceptance
 
 - **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole