From 77af4ded2a746514a4c6f7433e4b4f33fdab422b Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Mon, 15 Jun 2026 18:40:38 +0800 Subject: [PATCH] Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime) The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 --- docs/harness-ablation/stop-b-e2e-20260615.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/harness-ablation/stop-b-e2e-20260615.md b/docs/harness-ablation/stop-b-e2e-20260615.md index 91ec0e9..e08b5f0 100644 --- a/docs/harness-ablation/stop-b-e2e-20260615.md +++ b/docs/harness-ablation/stop-b-e2e-20260615.md @@ -30,6 +30,25 @@ config, is the bound). Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.** +> **⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only +> the Stop-B *mechanics*.** Two confounds: +> 1. **Trace-ceiling saturation.** TP2·DP2 and TP4 reached `best_sampling_u≈0.98` +> (still feasible after consuming ~the whole window), so their *true* peak +> per-GPU is higher than the 2.09 shown — we ran out of offered load to push +> them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48) +> found real boundaries. The `sampling_u` axis maxes at the full trace, so any +> config that sustains more than the window's offered rate cannot be measured. +> 2. **Smoke regime.** This run inherited `replay_time_scale=0.1` + +> `max_requests_per_probe=512` (README: convergence test, *not* a benchmark) — +> compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling. +> +> The below-ceiling TP1 (2.90) > TP2 (2.21) ordering *may* be real for this model +> (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP +> adds all-reduce overhead with little benefit), which differs from the dense +> Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark +> needs `scale=1.0`, no cap, and enough offered-load headroom that strong configs +> are not trace-saturated — see the 27B TP A/B follow-up. + ## Phase-5 acceptance - **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole