aituner

Author	SHA1	Message	Date
Gahow Wang	b779f6e56a	Add dash1 naive-completion driver for the ablation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:52:54 +08:00
Gahow Wang	e7d1b3ba01	Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse refinements, premature LLM stop vetoed then honored -> converged, no regression. Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5 trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible config found. The bottleneck is compute; the harness steered to the knob family that adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:51:56 +08:00
Gahow Wang	579dd86698	Ablation: --skip-baseline so loops climb from first proposal The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time compression, which tripped baseline_all_infeasible and terminated the loop before any climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal (harness steers to TP from the long-prompt profile) — the ablation is about the proposal path, so an explicit TP1 row is not required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:59:46 +08:00
Gahow Wang	37342a5749	Add chained harness-vs-naive ablation driver (sequential runs + DONE marker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:30:41 +08:00
Gahow Wang	5965f4fbbc	Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable) scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:29:30 +08:00
Gahow Wang	a1cbab0e69	Document harness-vs-naive ablation: setup, substrate calibration, blocker Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B: - both configs committed and validated on dash0 (differ only in use_harness + study_id), LLM auth + clean engine launch confirmed; - characterizes exactly what the harness toggles (Harnesses: prompt section with ranked bottleneck hypotheses + knob-family steering, deterministic guided/stop proposals, Stop-B validator/veto) vs naive; - substrate calibration from a real harness-ON run: at scale=0.2 the 180s elapsed cap fires correctly but TP1 is uniformly infeasible even at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a real baseline; comparability caveat documented. Honest status: full two-run sweep NOT completed in-session (~5-6 GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM teardown re-validated). Includes a precise continuation recipe and the scripts/ablation_trajectory.py helper (validated against a prior store). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:16:27 +08:00
Gahow Wang	0794efa249	Reduce ablation probe budget to 3 per trial for tractability First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260 preemptions on the lighter half of the trace; TP1 is decode-bound and the arrival-lag early-stop does not cut a decode-drain-bound probe). Cut search.max_probes 5->3 to bound binary-search steps per trial. Caps stay at elapsed=180/lag=30. Both configs still differ only in use_harness + study_id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:01:19 +08:00
Gahow Wang	d975e57bb5	Scale ablation early-stop caps to the compressed window (scale=0.2) At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:49:57 +08:00
Gahow Wang	a16016a876	Add harness vs naive ablation configs (27b, scale=0.2 substrate) Two configs identical except llm.use_harness and study_id, for the controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25, max_probes=5) keeps the ablation tractable; Stop-A stays enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:31:23 +08:00
Gahow Wang	07f5d92e1d	Add consolidated two-stop summary doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:16:28 +08:00
Gahow Wang	f2ff0faebd	Document Stop-B end-to-end on dense 27B: the improving climb + no-regression Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x), each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00 and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding: at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven to an explicit Stop-B firing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 18:07:00 +08:00
Gahow Wang	4a64196a99	Add 27B Stop-B agentic-loop config (harness-driven, GPUs 2-7) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:46 +08:00
Gahow Wang	b17b213575	Tear down the engine on SIGTERM instead of orphaning it Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs, ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the prior handler afterward. Main-thread-guarded; unit-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:06 +08:00
Gahow Wang	93ce339d61	Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 01:54:40 +08:00
Gahow Wang	b1b74318f6	Pin 27B A/B to GPUs 2-7 (route around leaked GPU0/1 memory) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:01:22 +08:00
Gahow Wang	2fcaf80450	Wrap socket/timeout errors in HTTP client as HttpClientError stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped _run_one_request (which only catches HttpClientError), propagated through the probe, and crashed the whole trial ("failed: timed out"). A timed-out request is a failed request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError, ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s to 180s on the 27B run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:58:28 +08:00
Gahow Wang	3541065675	Speed up 27B TP A/B: request_timeout 180s, search.high 0.125 The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes, and the 900s request timeout made overloaded probes drain hung requests for 15min each. Cap drain at 180s and bound the search to where the boundaries actually are. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:40:42 +08:00
Gahow Wang	7678c7d5e8	Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00
Gahow Wang	ed2bbe0323	Add linear_ms SLO rule (length-aware TTFT budget) threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125 (4s base, +1s per 8k input tokens). slo + spec + test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00
Gahow Wang	77af4ded2a	Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime) The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:40:38 +08:00
Gahow Wang	4f45b546a1	Add 27B TP A/B (deterministic ground-truth: does TP2 beat TP1 per-GPU) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:39:54 +08:00
Gahow Wang	90c3eb51c8	Document Stop-B end-to-end validation (Phase 5) Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both Stop-B paths: search-high-saturation (validator-authorized immediate stop) and multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90 req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is correctly never adopted (no regression). The Phase-4 authority model is exercised live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later justified stop is honored after the veto budget. EP launch-failures handled as hard-negative evidence. Auditable reason chains throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:58:44 +08:00
Gahow Wang	0b6beafeb8	Phase 5: widen search.high to 1.0 to force multi-iteration Stop-B convergence Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:32 +08:00
Gahow Wang	d4aff81691	Add Stop-B end-to-end config (agentic loop, Stop-A enabled) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:05:39 +08:00
Gahow Wang	f31e9ccfd5	Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved With the guard enabled the binary search recovers best sampling_u=0.078125 (rate 2.30 req/s), identical to the full-replay baseline. The guard fired on exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible); the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial with no peak-rate overestimate. Stop-A + boundary guard is safe to enable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:57:53 +08:00
Gahow Wang	03e556f0ab	Add Stop-A ON config (adaptive_stop enabled + boundary guard) for A/B Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:25:24 +08:00
Gahow Wang	dfc823f972	Add Stop-A SLO-boundary guard When a truncated probe's measured pass-rate lands within trace.adaptive_stop. boundary_delta of the SLO target, re-measure on the full window and use that verdict. Offered-L-C-A convergence cannot see engine-state drift in the window tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96 vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:25:24 +08:00
Gahow Wang	9f52812753	Document Stop-A validation: calibration + GPU fidelity check CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and shows C-convergence difficulty is driven by signal noise (low-reuse chat) not reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts preserved; the one mismatch is a boundary false-positive at the feasibility knee (prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:03:16 +08:00
Gahow Wang	958739027a	Fix Stop-A validation config: system vllm, cap max-model-len Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:22:48 +08:00
Gahow Wang	0f57ee96a9	Drop LLM endpoint from Stop-A full-data config (baseline-only run) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:19:46 +08:00
Gahow Wang	43125f48cf	Address review of two-stop branch - lca._prefix_profile: anchor the prefix window to the prefix's own first arrival so the A-rate is measured over the prefix span (matches the design intent; no-op for the 0-based canonical pipeline). - cli study tune: label file-originated stops as file_proposal rather than llm_after_veto_budget (the veto never applies to file proposals). - spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make convergence undetectable and silently disable Stop-A). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:19:08 +08:00
Gahow Wang	3af1d84ac0	Add Stop-A full-data validation config (real-time replay, no cap) A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0, so per-request probe_details capture the full 600s window for offline analysis of whether truncating at the L-C-A convergence prefix preserves the feasibility verdict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:15:12 +08:00
Gahow Wang	08e53fd897	Add Stop-A calibration script (CPU-only convergence curve) Prints the offered-L-C-A convergence curve and the stop fraction at candidate tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare how late C converges across workloads. No serving required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:10:02 +08:00
Gahow Wang	a8f903498d	Add Stop-B authority: deterministic validator overrides LLM stop Phase 4 of the two-stop work. The harness already pre-empts the LLM with deterministic stops and guided probes, but an LLM-originated should_stop could still end the loop while the validator saw remaining opportunity. Add harness._stop_authority, exposed as context["stop_authority"], whose `authorized` mirrors the deterministic harness stop decision and whose `opportunity_remains` flags an open topology frontier or a high-value planned candidate. In study tune, an LLM-originated should_stop is now honored only when the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so the loop cannot converge prematurely on the agent's say-so. File- and harness-originated stops are unaffected, and the stop reason chain is recorded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:45:14 +08:00
Gahow Wang	51a9e4a007	Add Stop-A: offered-L-C-A convergence early-stop for replay Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's Fig. 9 curve) can be computed up front rather than monitored live, with identical result and no per-request overhead. - lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family similarities reach tau and the slow C family reaches the stricter tau_c for stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature vector (same window -> identical per-dim spread; RobustScaler is reserved for the cross-window Stop-C). If C never converges it reports the full set, which is the C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as Phase 3 calibration data. - spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the thresholds are calibrated, so existing studies are unaffected. - worker._adaptive_replay_set truncates each probe's replay to the convergence prefix and records a certificate (converged, fraction, family similarity) into probe history and probe_details. Offered request_rate at the threshold is unchanged; only wall-clock replay shrinks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:23:49 +08:00
Gahow Wang	0f15bbc3f1	Make the offered-load axis session-coherent Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the realized KV-cache hit rate as offered load dropped — so the feasibility boundary was measured on a workload with a different C than production, contradicting the paper's scale-stationary L-C-A premise. prepare_trace_windows now resolves each row's session root via the parent_chat_id chain in a single streaming pass and assigns sampling_u per session, so thresholding keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose parent fell outside the span fall back to grouping under the parent id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:16:06 +08:00
Gahow Wang	6f8e3c95c1	Unify harness L-C-A on the canonical lca.WorkloadProfile Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile` previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block authoritative: build_harness_context now accepts an optional workload_profile and renders the canonical 10-dim vector + per-family stats when present, falling back to the legacy rendering only when no profile is supplied (direct unit-test calls). Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the profile via lca.build_study_workload_profile and pass it through build_prompt. The heuristic regime classifiers keep reading window_summary; that is the heuristic layer, distinct from the similarity metric. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:12:17 +08:00
Gahow Wang	8b4116fad0	Add reference paper and qwen27b tpot25 16-iter notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:30 +08:00
Gahow Wang	27d1c8fa92	Add L-C-A workload profile metric and CLI profile commands Implement the paper's 10-dimensional L-C-A workload feature vector (RobustScaler-normalized, sim=exp(-\|\|dz\|\|)) in lca.py, and wire it into `aituner profile window` / `aituner profile similarity`. Covered by tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:24 +08:00
Gahow Wang	984eb1f325	Document 8-GPU harness ablation results for qwen27b and qwen235b prefill Add completed experiment results from dash0 runs after 2026-05-13: - qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU) - qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU) Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation log with completed run status.	2026-05-16 21:23:16 +08:00
Gahow Wang	d0c89dac48	Clean marked trial engine processes	2026-05-16 15:51:04 +08:00
Gahow Wang	cf9b8b3f68	Clean vLLM process groups after parent exit	2026-05-16 14:52:05 +08:00
Gahow Wang	5a879a8592	Fix decode harness partial probe handling	2026-05-16 14:18:07 +08:00
Gahow Wang	f18765b235	Document eight-GPU harness rerun	2026-05-13 09:04:14 +08:00
Gahow Wang	5c2958e6c1	Constrain harness topology by visible GPUs	2026-05-13 01:25:31 +08:00
Gahow Wang	fb6d74a18c	Document harness v2 rerun criteria	2026-05-12 22:23:12 +08:00
Gahow Wang	e3ed775afd	Fix harness SLO early-stop diagnosis	2026-05-12 22:20:01 +08:00
Gahow Wang	ef359c8eea	Document profile-driven harness run	2026-05-12 21:40:19 +08:00
Gahow Wang	17e9681ca0	Add profile-driven harness planner	2026-05-12 21:28:44 +08:00
Gahow Wang	63d6a111f4	Document profile-driven harness design	2026-05-12 21:09:29 +08:00

1 2 3 4

165 Commits