The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.
Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
in both ablation specs so both arms uniformly use the current ~/.codex model (the
prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.
Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
a premature Stop-B until it is tried; the incumbent guard keeps the step only if
per-GPU rate improves and the engine launches, and the tested signature terminates
the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.
Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2)
and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary
search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the
most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4
boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility
results; if a runtime-tuned config ever saturates 0.15 the harness search-high
saturation stop fires (informative, not silent).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
(TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
>= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
far below the tau bar used everywhere else. New calibrator:
scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
wall-clock a *feasible* config needs to drain the LCA-admitted set
(last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.
12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and
use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span
TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
section with ranked bottleneck hypotheses + knob-family steering,
deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
180s elapsed cap fires correctly but TP1 is uniformly infeasible even
at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
real baseline; comparability caveat documented.
Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260
preemptions on the lighter half of the trace; TP1 is decode-bound and the
arrival-lag early-stop does not cut a decode-drain-bound probe). Cut
search.max_probes 5->3 to bound binary-search steps per trial. Caps stay
at elapsed=180/lag=30. Both configs still differ only in use_harness +
study_id.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two configs identical except llm.use_harness and study_id, for the
controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense
Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25,
max_probes=5) keeps the ablation tractable; Stop-A stays enabled.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the
vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive
on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM
handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs,
ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the
prior handler afterward. Main-thread-guarded; unit-tested.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>