The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.
Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.
Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
a premature Stop-B until it is tried; the incumbent guard keeps the step only if
per-GPU rate improves and the engine launches, and the tested signature terminates
the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.
Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
(TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
>= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
far below the tau bar used everywhere else. New calibrator:
scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
wall-clock a *feasible* config needs to drain the LCA-admitted set
(last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.
12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the
vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive
on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM
handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs,
ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the
prior handler afterward. Main-thread-guarded; unit-tested.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.
Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the
trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's
Fig. 9 curve) can be computed up front rather than monitored live, with identical
result and no per-request overhead.
- lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family
similarities reach tau and the slow C family reaches the stricter tau_c for
stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature
vector (same window -> identical per-dim spread; RobustScaler is reserved for the
cross-window Stop-C). If C never converges it reports the full set, which is the
C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as
Phase 3 calibration data.
- spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the
thresholds are calibrated, so existing studies are unaffected.
- worker._adaptive_replay_set truncates each probe's replay to the convergence
prefix and records a certificate (converged, fraction, family similarity) into
probe history and probe_details. Offered request_rate at the threshold is
unchanged; only wall-clock replay shrinks.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).
Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>