Commit Graph

165 Commits

Author SHA1 Message Date
b779f6e56a Add dash1 naive-completion driver for the ablation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:52:54 +08:00
e7d1b3ba01 Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:51:56 +08:00
579dd86698 Ablation: --skip-baseline so loops climb from first proposal
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:59:46 +08:00
37342a5749 Add chained harness-vs-naive ablation driver (sequential runs + DONE marker)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:30:41 +08:00
5965f4fbbc Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable)
scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and
use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span
TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:29:30 +08:00
a1cbab0e69 Document harness-vs-naive ablation: setup, substrate calibration, blocker
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
  use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
  section with ranked bottleneck hypotheses + knob-family steering,
  deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
  180s elapsed cap fires correctly but TP1 is uniformly infeasible even
  at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
  real baseline; comparability caveat documented.

Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:16:27 +08:00
0794efa249 Reduce ablation probe budget to 3 per trial for tractability
First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260
preemptions on the lighter half of the trace; TP1 is decode-bound and the
arrival-lag early-stop does not cut a decode-drain-bound probe). Cut
search.max_probes 5->3 to bound binary-search steps per trial. Caps stay
at elapsed=180/lag=30. Both configs still differ only in use_harness +
study_id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:01:19 +08:00
d975e57bb5 Scale ablation early-stop caps to the compressed window (scale=0.2)
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:49:57 +08:00
a16016a876 Add harness vs naive ablation configs (27b, scale=0.2 substrate)
Two configs identical except llm.use_harness and study_id, for the
controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense
Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25,
max_probes=5) keeps the ablation tractable; Stop-A stays enabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:31:23 +08:00
07f5d92e1d Add consolidated two-stop summary doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:16:28 +08:00
f2ff0faebd Document Stop-B end-to-end on dense 27B: the improving climb + no-regression
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 18:07:00 +08:00
4a64196a99 Add 27B Stop-B agentic-loop config (harness-driven, GPUs 2-7)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:08:46 +08:00
b17b213575 Tear down the engine on SIGTERM instead of orphaning it
Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the
vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive
on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM
handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs,
ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the
prior handler afterward. Main-thread-guarded; unit-tested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:08:06 +08:00
93ce339d61 Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00
b1b74318f6 Pin 27B A/B to GPUs 2-7 (route around leaked GPU0/1 memory)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:01:22 +08:00
2fcaf80450 Wrap socket/timeout errors in HTTP client as HttpClientError
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:58:28 +08:00
3541065675 Speed up 27B TP A/B: request_timeout 180s, search.high 0.125
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:40:42 +08:00
7678c7d5e8 Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
ed2bbe0323 Add linear_ms SLO rule (length-aware TTFT budget)
threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target
scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125
(4s base, +1s per 8k input tokens). slo + spec + test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
4f45b546a1 Add 27B TP A/B (deterministic ground-truth: does TP2 beat TP1 per-GPU)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:39:54 +08:00
90c3eb51c8 Document Stop-B end-to-end validation (Phase 5)
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
0b6beafeb8 Phase 5: widen search.high to 1.0 to force multi-iteration Stop-B convergence
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:32 +08:00
d4aff81691 Add Stop-B end-to-end config (agentic loop, Stop-A enabled)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:05:39 +08:00
f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
03e556f0ab Add Stop-A ON config (adaptive_stop enabled + boundary guard) for A/B
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
dfc823f972 Add Stop-A SLO-boundary guard
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
9f52812753 Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:03:16 +08:00
958739027a Fix Stop-A validation config: system vllm, cap max-model-len
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:22:48 +08:00
0f57ee96a9 Drop LLM endpoint from Stop-A full-data config (baseline-only run)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:46 +08:00
43125f48cf Address review of two-stop branch
- lca._prefix_profile: anchor the prefix window to the prefix's own first arrival
  so the A-rate is measured over the prefix span (matches the design intent;
  no-op for the 0-based canonical pipeline).
- cli study tune: label file-originated stops as file_proposal rather than
  llm_after_veto_budget (the veto never applies to file proposals).
- spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make
  convergence undetectable and silently disable Stop-A).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:08 +08:00
3af1d84ac0 Add Stop-A full-data validation config (real-time replay, no cap)
A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0,
so per-request probe_details capture the full 600s window for offline analysis of
whether truncating at the L-C-A convergence prefix preserves the feasibility verdict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:15:12 +08:00
08e53fd897 Add Stop-A calibration script (CPU-only convergence curve)
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:10:02 +08:00
a8f903498d Add Stop-B authority: deterministic validator overrides LLM stop
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.

Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:45:14 +08:00
51a9e4a007 Add Stop-A: offered-L-C-A convergence early-stop for replay
Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the
trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's
Fig. 9 curve) can be computed up front rather than monitored live, with identical
result and no per-request overhead.

- lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family
  similarities reach tau and the slow C family reaches the stricter tau_c for
  stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature
  vector (same window -> identical per-dim spread; RobustScaler is reserved for the
  cross-window Stop-C). If C never converges it reports the full set, which is the
  C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as
  Phase 3 calibration data.
- spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the
  thresholds are calibrated, so existing studies are unaffected.
- worker._adaptive_replay_set truncates each probe's replay to the convergence
  prefix and records a certificate (converged, fraction, family similarity) into
  probe history and probe_details. Offered request_rate at the threshold is
  unchanged; only wall-clock replay shrinks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:23:49 +08:00
0f15bbc3f1 Make the offered-load axis session-coherent
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.

prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:16:06 +08:00
6f8e3c95c1 Unify harness L-C-A on the canonical lca.WorkloadProfile
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).

Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:12:17 +08:00
8b4116fad0 Add reference paper and qwen27b tpot25 16-iter notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
27d1c8fa92 Add L-C-A workload profile metric and CLI profile commands
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:24 +08:00
984eb1f325 Document 8-GPU harness ablation results for qwen27b and qwen235b prefill
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)

Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.
2026-05-16 21:23:16 +08:00
d0c89dac48 Clean marked trial engine processes 2026-05-16 15:51:04 +08:00
cf9b8b3f68 Clean vLLM process groups after parent exit 2026-05-16 14:52:05 +08:00
5a879a8592 Fix decode harness partial probe handling 2026-05-16 14:18:07 +08:00
f18765b235 Document eight-GPU harness rerun 2026-05-13 09:04:14 +08:00
5c2958e6c1 Constrain harness topology by visible GPUs 2026-05-13 01:25:31 +08:00
fb6d74a18c Document harness v2 rerun criteria 2026-05-12 22:23:12 +08:00
e3ed775afd Fix harness SLO early-stop diagnosis 2026-05-12 22:20:01 +08:00
ef359c8eea Document profile-driven harness run 2026-05-12 21:40:19 +08:00
17e9681ca0 Add profile-driven harness planner 2026-05-12 21:28:44 +08:00
63d6a111f4 Document profile-driven harness design 2026-05-12 21:09:29 +08:00