Files
aituner/docs/harness-ablation/stop-b-e2e-20260615.md
Gahow Wang 77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00

87 lines
5.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Stop-B end-to-end validation (Phase 5) — 2026-06-15
Branch `feat/two-stop`. Real agentic loop on dash0: Qwen3-30B-A3B / vLLM 0.11.1 /
8×H20, `gpt-5.4` (via codex/prism) proposing configs, Stop-A enabled to accelerate
each evaluation, `use_harness=True` so the Stop-B deterministic validator + LLM-stop
veto are active. Config `dash0_qwen30b_a3b_stopB_e2e.json`, `search.high=1.0`,
`max_probes=6`, `--max-trials 8`.
## Two stop paths exercised
**Run A (`search.high=0.125`)** — the default config already saturates the offered-load
search range, so Stop-B fired immediately via the **search-high-saturation** path:
`stop_authorized_by: validator`, reason *"the incumbent's highest measured probe is
feasible and within the binary-search resolution of search.high."* Correct
measurement-ceiling stop (no point proposing configs when the load range, not the
config, is the bound).
**Run B (`search.high=1.0`)** — forces a real multi-iteration search:
| trial | TP | DP | EP | feasible | raw req/s | **req/s/GPU** | source |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 0001 | 1 | 1 | 1 | ✓ | 2.90 | **2.900** | baseline |
| 0002 | 2 | 1 | 1 | ✓ | 4.42 | 2.208 | harness TP-seed |
| 0003 | 2 | 1 | 2 | ✗ launch-fail | — | — | harness (EP) |
| 0004 | 1 | 2 | 1 | ✓ | 4.42 | 2.208 | LLM (after veto) |
| 0005 | 2 | 2 | 1 | ✓ | 8.37 | 2.092 | harness |
| 0006 | 2 | 2 | 2 | ✗ launch-fail | — | — | harness (EP) |
| 0007 | 4 | 1 | 1 | ✓ | 8.37 | 2.092 | LLM |
| (0008) | — | — | — | **STOP** | — | — | LLM stop, honored after veto budget |
Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.**
> **⚠️ The per-GPU trajectory above is NOT a valid benchmark — it validates only
> the Stop-B *mechanics*.** Two confounds:
> 1. **Trace-ceiling saturation.** TP2·DP2 and TP4 reached `best_sampling_u≈0.98`
> (still feasible after consuming ~the whole window), so their *true* peak
> per-GPU is higher than the 2.09 shown — we ran out of offered load to push
> them to their boundary. Only TP1 (u=0.31), TP2 (u=0.48) and DP2 (u=0.48)
> found real boundaries. The `sampling_u` axis maxes at the full trace, so any
> config that sustains more than the window's offered rate cannot be measured.
> 2. **Smoke regime.** This run inherited `replay_time_scale=0.1` +
> `max_requests_per_probe=512` (README: convergence test, *not* a benchmark) —
> compressed arrivals distort A and the 512 cap imposes a ~8.4 req/s ceiling.
>
> The below-ceiling TP1 (2.90) > TP2 (2.21) ordering *may* be real for this model
> (Qwen3-30B-A3B is an MoE with ~3B active params → little compute per token → TP
> adds all-reduce overhead with little benefit), which differs from the dense
> Qwen3.5-27B where TP2 wins. But this run cannot establish it. A valid benchmark
> needs `scale=1.0`, no cap, and enough offered-load headroom that strong configs
> are not trace-saturated — see the 27B TP A/B follow-up.
## Phase-5 acceptance
- **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole
run. Scaling TP/DP raised *raw* throughput (4.42, 8.37) but lowered per-GPU
efficiency (2.21, 2.09); the loop correctly kept the TP1 baseline as incumbent and
never adopted a worse-per-GPU config. (Matches the paper: long-prompt, low-reuse
chat prefers small TP for per-GPU efficiency.)
- **Stop-B authority validated live.** At trial 4 the LLM tried to stop
(`should_stop=true`); the deterministic validator **vetoed** it
(`validator_did_not_authorize_stop`, `continue_harness_guided_search`), forcing one
more confirmation (DP2, which also failed to beat baseline). After the budget, the
LLM's later, well-justified stop was honored (`stop_authorized_by:
llm_after_veto_budget`). The bounded veto behaved exactly as designed.
- **Auditable reason chain.** Every stop/veto carries a diagnosis grounded in the
measured evidence (e.g. *"increasing TP 1→2 lowers per-GPU efficiency even though
token latency improves … EP is explicitly blocked by launch-failure evidence"*).
- **Launch-failure robustness.** Two EP configs (trial-0003, 0006) failed to launch
under vLLM 0.11.1; the harness recorded them as hard-negative evidence and the LLM
explicitly stopped proposing EP.
## Notes / limitations
- For this workload the baseline (TP1) is already per-GPU optimal, so iterations-to-
*best* = 1; the remaining trials are the loop *confirming* no config beats baseline
before stopping. A workload with an under-tuned default would show an improving
trajectory; this run validates the stop/no-regression machinery, not a tuning win.
- The final stop came via `llm_after_veto_budget` (validator vetoed once, then
deferred), not a pure deterministic validator stop — because the deterministic
conditions (3-within-2%, saturation, validation-exhausted) did not cleanly fire
when every trial was a distinct config with a distinct per-GPU rate. The validator
acted as the *guard* (preventing premature stop), which is its designed role.
- 7 trials > the paper's 36 average, inflated by the wider search range, 2 EP
launch-failures, and the veto. Acceptable for a validation run.
- LLM token: the non-interactive shell lacks `OPENAI_API_KEY`; export it from the
codex `auth.json` (`~/.codex/auth.json`) before the run.