The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.
Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
in both ablation specs so both arms uniformly use the current ~/.codex model (the
prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.
Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
a premature Stop-B until it is tried; the incumbent guard keeps the step only if
per-GPU rate improves and the engine launches, and the tested signature terminates
the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.
Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2)
and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary
search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the
most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4
boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility
results; if a runtime-tuned config ever saturates 0.15 the harness search-high
saturation stop fires (informative, not silent).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
(TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
>= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
far below the tau bar used everywhere else. New calibrator:
scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
wall-clock a *feasible* config needs to drain the LCA-admitted set
(last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.
12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>