Files

Gahow Wang 77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)

The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 18:40:38 +08:00

5.2 KiB

Raw Permalink Blame History

Stop-B end-to-end validation (Phase 5) — 2026-06-15

Branch feat/two-stop. Real agentic loop on dash0: Qwen3-30B-A3B / vLLM 0.11.1 / 8×H20, gpt-5.4 (via codex/prism) proposing configs, Stop-A enabled to accelerate each evaluation, use_harness=True so the Stop-B deterministic validator + LLM-stop veto are active. Config dash0_qwen30b_a3b_stopB_e2e.json, search.high=1.0, max_probes=6, --max-trials 8.

Two stop paths exercised

Run A (search.high=0.125) — the default config already saturates the offered-load search range, so Stop-B fired immediately via the search-high-saturation path: stop_authorized_by: validator, reason "the incumbent's highest measured probe is feasible and within the binary-search resolution of search.high." Correct measurement-ceiling stop (no point proposing configs when the load range, not the config, is the bound).

Run B (search.high=1.0) — forces a real multi-iteration search:

trial	TP	DP	EP	feasible	raw req/s	req/s/GPU	source
0001	1	1	1	✓	2.90	2.900	baseline
0002	2	1	1	✓	4.42	2.208	harness TP-seed
0003	2	1	2	✗ launch-fail	—	—	harness (EP)
0004	1	2	1	✓	4.42	2.208	LLM (after veto)
0005	2	2	1	✓	8.37	2.092	harness
0006	2	2	2	✗ launch-fail	—	—	harness (EP)
0007	4	1	1	✓	8.37	2.092	LLM
(0008)	—	—	—	STOP	—	—	LLM stop, honored after veto budget

Incumbent: trial-0001 (TP1), 2.90 req/s/GPU — never beaten.

⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only the Stop-B mechanics. Two confounds:

Trace-ceiling saturation. TP2·DP2 and TP4 reached best_sampling_u≈0.98 (still feasible after consuming ~the whole window), so their true peak per-GPU is higher than the 2.09 shown — we ran out of offered load to push them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48) found real boundaries. The sampling_u axis maxes at the full trace, so any config that sustains more than the window's offered rate cannot be measured.

Smoke regime. This run inherited replay_time_scale=0.1 + max_requests_per_probe=512 (README: convergence test, not a benchmark) — compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling.

The below-ceiling TP1 (2.90) > TP2 (2.21) ordering may be real for this model (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP adds all-reduce overhead with little benefit), which differs from the dense Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark needs scale=1.0, no cap, and enough offered-load headroom that strong configs are not trace-saturated — see the 27B TP A/B follow-up.

Phase-5 acceptance

No regression. The primary metric request_rate_per_gpu stayed 2.90 the whole run. Scaling TP/DP raised raw throughput (4.42, 8.37) but lowered per-GPU efficiency (2.21, 2.09); the loop correctly kept the TP1 baseline as incumbent and never adopted a worse-per-GPU config. (Matches the paper: long-prompt, low-reuse chat prefers small TP for per-GPU efficiency.)
Stop-B authority validated live. At trial 4 the LLM tried to stop (should_stop=true); the deterministic validator vetoed it (validator_did_not_authorize_stop, continue_harness_guided_search), forcing one more confirmation (DP2, which also failed to beat baseline). After the budget, the LLM's later, well-justified stop was honored (stop_authorized_by: llm_after_veto_budget). The bounded veto behaved exactly as designed.
Auditable reason chain. Every stop/veto carries a diagnosis grounded in the measured evidence (e.g. "increasing TP 1→2 lowers per-GPU efficiency even though token latency improves … EP is explicitly blocked by launch-failure evidence").
Launch-failure robustness. Two EP configs (trial-0003, 0006) failed to launch under vLLM 0.11.1; the harness recorded them as hard-negative evidence and the LLM explicitly stopped proposing EP.

Notes / limitations

For this workload the baseline (TP1) is already per-GPU optimal, so iterations-to- best = 1; the remaining trials are the loop confirming no config beats baseline before stopping. A workload with an under-tuned default would show an improving trajectory; this run validates the stop/no-regression machinery, not a tuning win.
The final stop came via llm_after_veto_budget (validator vetoed once, then deferred), not a pure deterministic validator stop — because the deterministic conditions (3-within-2%, saturation, validation-exhausted) did not cleanly fire when every trial was a distinct config with a distinct per-GPU rate. The validator acted as the guard (preventing premature stop), which is its designed role.
7 trials > the paper's 3–6 average, inflated by the wider search range, 2 EP launch-failures, and the veto. Acceptable for a validation run.
LLM token: the non-interactive shell lacks OPENAI_API_KEY; export it from the codex auth.json (~/.codex/auth.json) before the run.

5.2 KiB Raw Permalink Blame History Unescape Escape

Stop-B end-to-end validation (Phase 5) — 2026-06-15

Two stop paths exercised

Phase-5 acceptance

Notes / limitations

5.2 KiB

Raw Permalink Blame History