Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and
use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span
TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
section with ranked bottleneck hypotheses + knob-family steering,
deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
180s elapsed cap fires correctly but TP1 is uniformly infeasible even
at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
real baseline; comparability caveat documented.
Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260
preemptions on the lighter half of the trace; TP1 is decode-bound and the
arrival-lag early-stop does not cut a decode-drain-bound probe). Cut
search.max_probes 5->3 to bound binary-search steps per trial. Caps stay
at elapsed=180/lag=30. Both configs still differ only in use_harness +
study_id.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two configs identical except llm.use_harness and study_id, for the
controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense
Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25,
max_probes=5) keeps the ablation tractable; Stop-A stays enabled.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the
vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive
on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM
handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs,
ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the
prior handler afterward. Main-thread-guarded; unit-tested.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- lca._prefix_profile: anchor the prefix window to the prefix's own first arrival
so the A-rate is measured over the prefix span (matches the design intent;
no-op for the 0-based canonical pipeline).
- cli study tune: label file-originated stops as file_proposal rather than
llm_after_veto_budget (the veto never applies to file proposals).
- spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make
convergence undetectable and silently disable Stop-A).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0,
so per-request probe_details capture the full 600s window for offline analysis of
whether truncating at the L-C-A convergence prefix preserves the feasibility verdict.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.
Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the
trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's
Fig. 9 curve) can be computed up front rather than monitored live, with identical
result and no per-request overhead.
- lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family
similarities reach tau and the slow C family reaches the stricter tau_c for
stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature
vector (same window -> identical per-dim spread; RobustScaler is reserved for the
cross-window Stop-C). If C never converges it reports the full set, which is the
C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as
Phase 3 calibration data.
- spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the
thresholds are calibrated, so existing studies are unaffected.
- worker._adaptive_replay_set truncates each probe's replay to the convergence
prefix and records a certificate (converged, fraction, family similarity) into
probe history and probe_details. Offered request_rate at the threshold is
unchanged; only wall-clock replay shrinks.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.
prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).
Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)
Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.