Document Stop-B end-to-end validation (Phase 5)

Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both Stop-B paths: search-high-saturation (validator-authorized immediate stop) and multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90 req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is correctly never adopted (no regression). The Phase-4 authority model is exercised live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later justified stop is honored after the veto budget. EP launch-failures handled as hard-negative evidence. Auditable reason chains throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
parent 0b6beafeb8
commit 90c3eb51c8
1 changed files with 67 additions and 0 deletions
--- a/docs/harness-ablation/stop-b-e2e-20260615.md
+++ b/docs/harness-ablation/stop-b-e2e-20260615.md
@@ -0,0 +1,67 @@
+# Stop-B end-to-end validation (Phase 5) — 2026-06-15
+
+Branch `feat/two-stop`. Real agentic loop on dash0: Qwen3-30B-A3B / vLLM 0.11.1 /
+8×H20, `gpt-5.4` (via codex/prism) proposing configs, Stop-A enabled to accelerate
+each evaluation, `use_harness=True` so the Stop-B deterministic validator + LLM-stop
+veto are active. Config `dash0_qwen30b_a3b_stopB_e2e.json`, `search.high=1.0`,
+`max_probes=6`, `--max-trials 8`.
+
+## Two stop paths exercised
+
+**Run A (`search.high=0.125`)** — the default config already saturates the offered-load
+search range, so Stop-B fired immediately via the **search-high-saturation** path:
+`stop_authorized_by: validator`, reason *"the incumbent's highest measured probe is
+feasible and within the binary-search resolution of search.high."* Correct
+measurement-ceiling stop (no point proposing configs when the load range, not the
+config, is the bound).
+
+**Run B (`search.high=1.0`)** — forces a real multi-iteration search:
+
+| trial | TP | DP | EP | feasible | raw req/s | **req/s/GPU** | source |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| 0001 | 1 | 1 | 1 | ✓ | 2.90 | **2.900** | baseline |
+| 0002 | 2 | 1 | 1 | ✓ | 4.42 | 2.208 | harness TP-seed |
+| 0003 | 2 | 1 | 2 | ✗ launch-fail | — | — | harness (EP) |
+| 0004 | 1 | 2 | 1 | ✓ | 4.42 | 2.208 | LLM (after veto) |
+| 0005 | 2 | 2 | 1 | ✓ | 8.37 | 2.092 | harness |
+| 0006 | 2 | 2 | 2 | ✗ launch-fail | — | — | harness (EP) |
+| 0007 | 4 | 1 | 1 | ✓ | 8.37 | 2.092 | LLM |
+| (0008) | — | — | — | **STOP** | — | — | LLM stop, honored after veto budget |
+
+Incumbent: **trial-0001 (TP1), 2.90 req/s/GPU — never beaten.**
+
+## Phase-5 acceptance
+
+- **No regression.** The primary metric `request_rate_per_gpu` stayed 2.90 the whole
+  run. Scaling TP/DP raised *raw* throughput (4.42, 8.37) but lowered per-GPU
+  efficiency (2.21, 2.09); the loop correctly kept the TP1 baseline as incumbent and
+  never adopted a worse-per-GPU config. (Matches the paper: long-prompt, low-reuse
+  chat prefers small TP for per-GPU efficiency.)
+- **Stop-B authority validated live.** At trial 4 the LLM tried to stop
+  (`should_stop=true`); the deterministic validator **vetoed** it
+  (`validator_did_not_authorize_stop`, `continue_harness_guided_search`), forcing one
+  more confirmation (DP2, which also failed to beat baseline). After the budget, the
+  LLM's later, well-justified stop was honored (`stop_authorized_by:
+  llm_after_veto_budget`). The bounded veto behaved exactly as designed.
+- **Auditable reason chain.** Every stop/veto carries a diagnosis grounded in the
+  measured evidence (e.g. *"increasing TP 1→2 lowers per-GPU efficiency even though
+  token latency improves … EP is explicitly blocked by launch-failure evidence"*).
+- **Launch-failure robustness.** Two EP configs (trial-0003, 0006) failed to launch
+  under vLLM 0.11.1; the harness recorded them as hard-negative evidence and the LLM
+  explicitly stopped proposing EP.
+
+## Notes / limitations
+
+- For this workload the baseline (TP1) is already per-GPU optimal, so iterations-to-
+  *best* = 1; the remaining trials are the loop *confirming* no config beats baseline
+  before stopping. A workload with an under-tuned default would show an improving
+  trajectory; this run validates the stop/no-regression machinery, not a tuning win.
+- The final stop came via `llm_after_veto_budget` (validator vetoed once, then
+  deferred), not a pure deterministic validator stop — because the deterministic
+  conditions (3-within-2%, saturation, validation-exhausted) did not cleanly fire
+  when every trial was a distinct config with a distinct per-GPU rate. The validator
+  acted as the *guard* (preventing premature stop), which is its designed role.
+- 7 trials > the paper's 3–6 average, inflated by the wider search range, 2 EP
+  launch-failures, and the veto. Acceptable for a validation run.
+- LLM token: the non-interactive shell lacks `OPENAI_API_KEY`; export it from the
+  codex `auth.json` (`~/.codex/auth.json`) before the run.