Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved

With the guard enabled the binary search recovers best sampling_u=0.078125 (rate 2.30 req/s), identical to the full-replay baseline. The guard fired on exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible); the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial with no peak-rate overestimate. Stop-A + boundary guard is safe to enable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
parent 03e556f0ab
commit f31e9ccfd5
1 changed files with 25 additions and 0 deletions
--- a/docs/harness-ablation/stop-a-validation-20260615.md
+++ b/docs/harness-ablation/stop-a-validation-20260615.md
@@ -79,6 +79,31 @@ L-C-A converges. It targets exactly this knee case at low extra cost (it only
 extends replay on probes sitting on the feasibility boundary). Recommend adding it
 as a small Stop-A enhancement before enabling Stop-A in production studies.
 ## 4. SLO-boundary guard (implemented + validated)
 Added `trace.adaptive_stop.boundary_delta` (default 0.02): when a truncated probe's
 measured pass-rate lands within ±δ of the SLO target, re-measure on the full window
 and use that verdict. Re-ran the same config with `adaptive_stop` enabled
 (τ=0.9, τ_c=0.90, δ=0.02):
 | threshold | feasible | pass | selected | replayed | boundary_extended |
 | --- | --- | --- | --- | --- | --- |
 | 0.06250 | True | 1.000 | 1086 | 487 (45%) | — |
 | 0.09375 | False | 0.444 | 1656 | 822 (50%) | — |
 | 0.07812 | True | 0.994 | 1378 | 682 (49%) | — |
 | 0.08594 | **False** | 0.947 | 1523 | **1523 (100%)** | **True** |
 Result: best feasible `sampling_u=0.078125` (rate 2.30 req/s) — **identical to the
 full-replay baseline**. The guard fired on exactly the one knee probe and
 re-measured it to the correct infeasible verdict; the other three probes truncated
 to ~45–50%. Net replayed 3514/5643 requests ≈ **38% replay saved on this trial
 while recovering the correct peak rate** (no one-step overestimate).
 **Conclusion: Stop-A with the boundary guard is correct (verdict matches full
 replay) and still saves replay time. Safe to enable.** Configs:
 `dash0_qwen30b_a3b_stopA_fulldata.json` (OFF baseline) and
 `dash0_qwen30b_a3b_stopA_on.json` (ON).
 ## Repro
 ```