stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- lca._prefix_profile: anchor the prefix window to the prefix's own first arrival
so the A-rate is measured over the prefix span (matches the design intent;
no-op for the 0-based canonical pipeline).
- cli study tune: label file-originated stops as file_proposal rather than
llm_after_veto_budget (the veto never applies to file proposals).
- spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make
convergence undetectable and silently disable Stop-A).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0,
so per-request probe_details capture the full 600s window for offline analysis of
whether truncating at the L-C-A convergence prefix preserves the feasibility verdict.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.
Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the
trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's
Fig. 9 curve) can be computed up front rather than monitored live, with identical
result and no per-request overhead.
- lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family
similarities reach tau and the slow C family reaches the stricter tau_c for
stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature
vector (same window -> identical per-dim spread; RobustScaler is reserved for the
cross-window Stop-C). If C never converges it reports the full set, which is the
C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as
Phase 3 calibration data.
- spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the
thresholds are calibrated, so existing studies are unaffected.
- worker._adaptive_replay_set truncates each probe's replay to the convergence
prefix and records a certificate (converged, fraction, family similarity) into
probe history and probe_details. Offered request_rate at the threshold is
unchanged; only wall-clock replay shrinks.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.
prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).
Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)
Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.