- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
in both ablation specs so both arms uniformly use the current ~/.codex model (the
prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
(TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
>= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
far below the tau bar used everywhere else. New calibrator:
scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
wall-clock a *feasible* config needs to drain the LCA-admitted set
(last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.
12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.
prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).
Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>