Commit Graph

181 Commits

Author SHA1 Message Date
4607711bb5 Add reusable clean pair runner 2026-06-22 00:05:31 +08:00
d23b69219b Add clean dash1 harness ablation runner 2026-06-21 00:51:08 +08:00
488fae7e63 Add tuning progress report for harness evaluation 2026-06-21 00:48:21 +08:00
426151bc9f Harness stop uses full state baseline 2026-06-20 22:48:27 +08:00
a9d237bbfd Show effective flags in ablation trajectory 2026-06-20 10:24:53 +08:00
5257fbc1a2 Improve harness incumbent follow-up search 2026-06-20 05:37:15 +08:00
b3156a382a Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed)
The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.

Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 13:33:29 +08:00
76cca89a43 Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:29:32 +08:00
83162e7a64 Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm
- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
  in both ablation specs so both arms uniformly use the current ~/.codex model (the
  prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
  instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:47 +08:00
a3523f5601 Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.

Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
  ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
  a premature Stop-B until it is tried; the incumbent guard keeps the step only if
  per-GPU rate improves and the engine launches, and the tested signature terminates
  the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.

Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:25:47 +08:00
95c02d7dd9 Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism)
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:06:05 +08:00
a1b804f879 Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes)
Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2)
and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary
search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the
most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4
boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility
results; if a runtime-tuned config ever saturates 0.15 the harness search-high
saturation stop fires (informative, not silent).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 22:11:52 +08:00
0c23285f39 Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
  0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
  (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
  scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
  >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
  far below the tau bar used everywhere else. New calibrator:
  scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
  wall-clock a *feasible* config needs to drain the LCA-admitted set
  (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
  outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
  the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
  hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.

12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 17:24:00 +08:00
816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00
97d2ddabb1 Ablation driver: force direct LLM connection (codex proxy is dash0-local)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 10:05:44 +08:00
8e58b4033d Note dash1 lacks LLM gateway access (naive-completion deferred to dash0)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:55:39 +08:00
b779f6e56a Add dash1 naive-completion driver for the ablation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:52:54 +08:00
e7d1b3ba01 Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:51:56 +08:00
579dd86698 Ablation: --skip-baseline so loops climb from first proposal
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:59:46 +08:00
37342a5749 Add chained harness-vs-naive ablation driver (sequential runs + DONE marker)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:30:41 +08:00
5965f4fbbc Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable)
scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and
use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span
TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:29:30 +08:00
a1cbab0e69 Document harness-vs-naive ablation: setup, substrate calibration, blocker
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
  use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
  section with ranked bottleneck hypotheses + knob-family steering,
  deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
  180s elapsed cap fires correctly but TP1 is uniformly infeasible even
  at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
  real baseline; comparability caveat documented.

Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:16:27 +08:00
0794efa249 Reduce ablation probe budget to 3 per trial for tractability
First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260
preemptions on the lighter half of the trace; TP1 is decode-bound and the
arrival-lag early-stop does not cut a decode-drain-bound probe). Cut
search.max_probes 5->3 to bound binary-search steps per trial. Caps stay
at elapsed=180/lag=30. Both configs still differ only in use_harness +
study_id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:01:19 +08:00
d975e57bb5 Scale ablation early-stop caps to the compressed window (scale=0.2)
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:49:57 +08:00
a16016a876 Add harness vs naive ablation configs (27b, scale=0.2 substrate)
Two configs identical except llm.use_harness and study_id, for the
controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense
Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25,
max_probes=5) keeps the ablation tractable; Stop-A stays enabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:31:23 +08:00
07f5d92e1d Add consolidated two-stop summary doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:16:28 +08:00
f2ff0faebd Document Stop-B end-to-end on dense 27B: the improving climb + no-regression
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 18:07:00 +08:00
4a64196a99 Add 27B Stop-B agentic-loop config (harness-driven, GPUs 2-7)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:08:46 +08:00
b17b213575 Tear down the engine on SIGTERM instead of orphaning it
Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the
vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive
on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM
handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs,
ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the
prior handler afterward. Main-thread-guarded; unit-tested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:08:06 +08:00
93ce339d61 Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00
b1b74318f6 Pin 27B A/B to GPUs 2-7 (route around leaked GPU0/1 memory)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:01:22 +08:00
2fcaf80450 Wrap socket/timeout errors in HTTP client as HttpClientError
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:58:28 +08:00
3541065675 Speed up 27B TP A/B: request_timeout 180s, search.high 0.125
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:40:42 +08:00
7678c7d5e8 Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
ed2bbe0323 Add linear_ms SLO rule (length-aware TTFT budget)
threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target
scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125
(4s base, +1s per 8k input tokens). slo + spec + test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
4f45b546a1 Add 27B TP A/B (deterministic ground-truth: does TP2 beat TP1 per-GPU)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:39:54 +08:00
90c3eb51c8 Document Stop-B end-to-end validation (Phase 5)
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
0b6beafeb8 Phase 5: widen search.high to 1.0 to force multi-iteration Stop-B convergence
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:32 +08:00
d4aff81691 Add Stop-B end-to-end config (agentic loop, Stop-A enabled)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:05:39 +08:00
f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
03e556f0ab Add Stop-A ON config (adaptive_stop enabled + boundary guard) for A/B
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
dfc823f972 Add Stop-A SLO-boundary guard
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
9f52812753 Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:03:16 +08:00
958739027a Fix Stop-A validation config: system vllm, cap max-model-len
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:22:48 +08:00
0f57ee96a9 Drop LLM endpoint from Stop-A full-data config (baseline-only run)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:46 +08:00
43125f48cf Address review of two-stop branch
- lca._prefix_profile: anchor the prefix window to the prefix's own first arrival
  so the A-rate is measured over the prefix span (matches the design intent;
  no-op for the 0-based canonical pipeline).
- cli study tune: label file-originated stops as file_proposal rather than
  llm_after_veto_budget (the veto never applies to file proposals).
- spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make
  convergence undetectable and silently disable Stop-A).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:08 +08:00
3af1d84ac0 Add Stop-A full-data validation config (real-time replay, no cap)
A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0,
so per-request probe_details capture the full 600s window for offline analysis of
whether truncating at the L-C-A convergence prefix preserves the feasibility verdict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:15:12 +08:00
08e53fd897 Add Stop-A calibration script (CPU-only convergence curve)
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:10:02 +08:00
a8f903498d Add Stop-B authority: deterministic validator overrides LLM stop
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.

Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:45:14 +08:00