Files
aituner/docs/harness-ablation/stop-a-validation-20260615.md
Gahow Wang f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00

5.5 KiB
Raw Permalink Blame History

Stop-A validation (Phase 3) — 2026-06-15

Branch feat/two-stop. Stop-A = truncate each probe's replay once the offered L-C-A of the replayed prefix converges to the full set (pure L-C-A criterion + C-gate). This note records the CPU calibration and the GPU fidelity check.

1. Calibration (CPU, no serving)

scripts/stop_a_calibration.py on the dash0 0321 10:0010:10 windows:

dim chat (19239 req, hit≈7%) coder (2451 req, structured reuse)
A ≥0.95 by frac 0.10 fast
L ≥0.96 from frac 0.05 0.05=0.75 (heavy tail) → ≥0.94 by 0.20
C (slowest) noisy, dips (0.50→0.885, 0.55→0.835), stable ≥0.92 only ~0.85 smooth, stable ≥0.92 by ~0.70

Stop fraction (τ_L=τ_A=0.90, W=3):

τ_c chat coder
0.85 0.45 (273s) 0.45 (255s)
0.90 0.70 (423s) 0.55 (318s)
0.92 0.85 (513s) 0.70 (411s)

Findings:

  • C is the slowest dimension in both workloads — reproduces paper §5.2 / Fig 9.
  • What makes C hard to call converged is signal noise, not reuse magnitude. Low-reuse chat has a sparse/spiky ideal-hit-length series, so its C similarity oscillates and is harder to stabilize than the structured, higher-reuse coder. Consequence: a strict τ_c (0.92) gives chat only ~15% saving. A more robust C feature for the low-reuse regime is future work.

2. GPU fidelity check (Qwen3-30B-A3B, vLLM 0.11.1, H20)

One full-window run (adaptive_stop disabled, replay_time_scale=1.0, window chat_w20260311_1000, 08k, out=128), then scripts/stop_a_validate.py recomputes each probe's convergence prefix and compares the truncated verdict to the full verdict — so a single GPU run validates truncation fidelity (no second run).

Trial result: best feasible sampling_u=0.078125, request_rate 2.30 req/s, pass_rate 0.973.

Per-probe verdict (τ=0.9):

τ_c verdict matches mean replay saved
0.85 3/4 54%
0.90 3/4 52%
0.92 3/4 38%

The mismatch is the same probe at every τ_c — the feasibility knee 0.08594:

thresh    full_pass  prefix_pass  full_feas  prefix_feas
0.08594   0.946      0.9560.961  False      True   <- mismatch
0.07812   0.973      0.9870.990  True       True
0.06250   0.986      1.000        True       True
0.09375   0.268      0.490.54    False      False

3. Interpretation

  • Stop-A works and saves ~50% of replay (vs the full 600 s window) while preserving 3/4 probe verdicts. (The paper's ~70% is vs a 30-min fixed baseline; our baseline is the 600 s window, so the percentages are not directly comparable.)
  • The one failure is a boundary false-positive at the feasibility knee. At 0.08594 the full window is 0.946 (just below the 0.95 SLO) but the prefix is 0.9560.961 (just above): the second half of the window degraded — engine-state drift (KV fill / fragmentation / later-arriving harder requests) that the offered L-C-A cannot see. The C-gate did not help because offered-C had converged; the divergence is in the measured pass-rate, not in C.
  • If Stop-A were enabled, the binary search would accept 0.08594, overestimating the peak sustainable rate by one binary step (~10%).

This is the boundary jitter we accepted when choosing the pure-L-C-A criterion. The data now argues for revisiting the previously-declined SLO-boundary guard: keep replaying while the measured pass-rate is within ±δ of the target, even after L-C-A converges. It targets exactly this knee case at low extra cost (it only extends replay on probes sitting on the feasibility boundary). Recommend adding it as a small Stop-A enhancement before enabling Stop-A in production studies.

4. SLO-boundary guard (implemented + validated)

Added trace.adaptive_stop.boundary_delta (default 0.02): when a truncated probe's measured pass-rate lands within ±δ of the SLO target, re-measure on the full window and use that verdict. Re-ran the same config with adaptive_stop enabled (τ=0.9, τ_c=0.90, δ=0.02):

threshold feasible pass selected replayed boundary_extended
0.06250 True 1.000 1086 487 (45%)
0.09375 False 0.444 1656 822 (50%)
0.07812 True 0.994 1378 682 (49%)
0.08594 False 0.947 1523 1523 (100%) True

Result: best feasible sampling_u=0.078125 (rate 2.30 req/s) — identical to the full-replay baseline. The guard fired on exactly the one knee probe and re-measured it to the correct infeasible verdict; the other three probes truncated to ~4550%. Net replayed 3514/5643 requests ≈ 38% replay saved on this trial while recovering the correct peak rate (no one-step overestimate).

Conclusion: Stop-A with the boundary guard is correct (verdict matches full replay) and still saves replay time. Safe to enable. Configs: dash0_qwen30b_a3b_stopA_fulldata.json (OFF baseline) and dash0_qwen30b_a3b_stopA_on.json (ON).

Repro

# calibration
PYTHONPATH=src python3 scripts/stop_a_calibration.py \
  --jsonl <DIR>/qwen_chat_blksz_64_032109-032111.jsonl --block-size 64 \
  --window-start 3600 --window-end 4200 --gpu-count 8 --label chat
# GPU run + fidelity
PYTHONPATH=src python3 -m aituner.cli study tune \
  --spec configs/examples/dash0_qwen30b_a3b_stopA_fulldata.json \
  --store-root .aituner/stopA-fulldata --max-trials 1
PYTHONPATH=src python3 scripts/stop_a_validate.py \
  --spec configs/examples/dash0_qwen30b_a3b_stopA_fulldata.json \
  --store-root .aituner/stopA-fulldata --tau 0.9 --tau-c 0.90