aituner

Author	SHA1	Message	Date
Gahow Wang	1b8f5a3af1	Integrate descriptor runtime candidates into harness	2026-06-30 14:10:19 +08:00
Gahow Wang	adb5356c4b	Add advisory harness attribution and descriptor planner MVP	2026-06-30 12:05:03 +08:00
Gahow Wang	6ea259a0a3	Keep target topology explicit in delta projections	2026-06-29 19:56:50 +08:00
Gahow Wang	6b4efdad82	Relax lower-frontier delta projection gate	2026-06-29 17:57:29 +08:00
Gahow Wang	9ef9550214	Use full state for frontier projection	2026-06-29 16:22:09 +08:00
Gahow Wang	8dd9ada194	Add frontier delta projection harness candidates	2026-06-29 16:15:06 +08:00
Gahow Wang	6b25d56c1f	Gate GMU climb on measured improvement	2026-06-29 02:00:41 +08:00
Gahow Wang	ee101a7c24	Harden prefill scheduler harness	2026-06-29 01:54:02 +08:00
Gahow Wang	bfd85793f3	Prioritize uncovered prefill scheduler candidates	2026-06-29 01:30:34 +08:00
Gahow Wang	36c301c128	Add normalized prefill scheduler harness	2026-06-29 01:12:19 +08:00
Gahow Wang	7ad439730e	Add llm-first tuning proposal policy	2026-06-27 12:21:51 +08:00
Gahow Wang	2937539b49	Persist harness candidate set snapshots	2026-06-26 22:17:47 +08:00
Gahow Wang	5080b50315	Veto repeated materialized configs	2026-06-26 22:15:47 +08:00
Gahow Wang	825d3e03e9	Add harness candidate set audit	2026-06-26 22:02:09 +08:00
Gahow Wang	48911b658b	Use normalized full config signatures	2026-06-26 21:28:10 +08:00
Gahow Wang	c8a0f9870e	Tighten topology and auto-high validation	2026-06-26 20:07:23 +08:00
Gahow Wang	1dd3eaebaa	Add auto search high measurement policy	2026-06-26 20:05:22 +08:00
Gahow Wang	92eb186006	Add bad-start harness recovery planning	2026-06-26 16:44:24 +08:00
Gahow Wang	013b01baa1	Stop after gmu ceiling validation is exhausted	2026-06-24 22:45:42 +08:00
Gahow Wang	b075afe6f2	Continue gmu hill-climb after topology validation	2026-06-24 19:09:35 +08:00
Gahow Wang	8fa758797e	Guard generic topology search from introducing EP	2026-06-24 15:21:22 +08:00
Gahow Wang	e67bc86240	Probe coupled prefill runtime knobs before stop	2026-06-22 19:30:23 +08:00
Gahow Wang	fd94ab9f3b	Prevent prefill convergence stop before seq probe	2026-06-22 14:43:55 +08:00
Gahow Wang	4607711bb5	Add reusable clean pair runner	2026-06-22 00:05:31 +08:00
Gahow Wang	426151bc9f	Harness stop uses full state baseline	2026-06-20 22:48:27 +08:00
Gahow Wang	5257fbc1a2	Improve harness incumbent follow-up search	2026-06-20 05:37:15 +08:00
Gahow Wang	b3156a382a	Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed) The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled = cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1) and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92 at iter 2 instead of climbing to TP4. The candidate path runs before the topology- frontier check, so a score>=0.35 runtime candidate wins. Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 13:33:29 +08:00
Gahow Wang	a3523f5601	Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B The harness defined a gpu-memory-utilization family but hard-coded active_now=False and never generated a candidate for it, and only ever lowered max-num-seqs for decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873 (+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real coverage gap, not bad luck. Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology has moved off the baseline, so a baseline latency bottleneck still gets a TP change): - Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block a premature Stop-B until it is tried; the incumbent guard keeps the step only if per-GPU rate improves and the engine launches, and the tested signature terminates the climb (so 0.96 OOM/regression backs off to 0.94 automatically). - Let max-num-seqs rise for decode_tpot (not only fall) to exploit decode parallelism. - Activate the gpu-memory-utilization harness family for decode_tpot/admission. Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass. Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:25:47 +08:00
Gahow Wang	0c23285f39	Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a feasible config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:24:00 +08:00
Gahow Wang	b17b213575	Tear down the engine on SIGTERM instead of orphaning it Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs, ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the prior handler afterward. Main-thread-guarded; unit-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:06 +08:00
Gahow Wang	2fcaf80450	Wrap socket/timeout errors in HTTP client as HttpClientError stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped _run_one_request (which only catches HttpClientError), propagated through the probe, and crashed the whole trial ("failed: timed out"). A timed-out request is a failed request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError, ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s to 180s on the 27B run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:58:28 +08:00
Gahow Wang	ed2bbe0323	Add linear_ms SLO rule (length-aware TTFT budget) threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125 (4s base, +1s per 8k input tokens). slo + spec + test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00
Gahow Wang	dfc823f972	Add Stop-A SLO-boundary guard When a truncated probe's measured pass-rate lands within trace.adaptive_stop. boundary_delta of the SLO target, re-measure on the full window and use that verdict. Offered-L-C-A convergence cannot see engine-state drift in the window tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96 vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:25:24 +08:00
Gahow Wang	a8f903498d	Add Stop-B authority: deterministic validator overrides LLM stop Phase 4 of the two-stop work. The harness already pre-empts the LLM with deterministic stops and guided probes, but an LLM-originated should_stop could still end the loop while the validator saw remaining opportunity. Add harness._stop_authority, exposed as context["stop_authority"], whose `authorized` mirrors the deterministic harness stop decision and whose `opportunity_remains` flags an open topology frontier or a high-value planned candidate. In study tune, an LLM-originated should_stop is now honored only when the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so the loop cannot converge prematurely on the agent's say-so. File- and harness-originated stops are unaffected, and the stop reason chain is recorded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:45:14 +08:00
Gahow Wang	51a9e4a007	Add Stop-A: offered-L-C-A convergence early-stop for replay Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's Fig. 9 curve) can be computed up front rather than monitored live, with identical result and no per-request overhead. - lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family similarities reach tau and the slow C family reaches the stricter tau_c for stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature vector (same window -> identical per-dim spread; RobustScaler is reserved for the cross-window Stop-C). If C never converges it reports the full set, which is the C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as Phase 3 calibration data. - spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the thresholds are calibrated, so existing studies are unaffected. - worker._adaptive_replay_set truncates each probe's replay to the convergence prefix and records a certificate (converged, fraction, family similarity) into probe history and probe_details. Offered request_rate at the threshold is unchanged; only wall-clock replay shrinks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:23:49 +08:00
Gahow Wang	6f8e3c95c1	Unify harness L-C-A on the canonical lca.WorkloadProfile Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile` previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block authoritative: build_harness_context now accepts an optional workload_profile and renders the canonical 10-dim vector + per-family stats when present, falling back to the legacy rendering only when no profile is supplied (direct unit-test calls). Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the profile via lca.build_study_workload_profile and pass it through build_prompt. The heuristic regime classifiers keep reading window_summary; that is the heuristic layer, distinct from the similarity metric. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:12:17 +08:00
Gahow Wang	27d1c8fa92	Add L-C-A workload profile metric and CLI profile commands Implement the paper's 10-dimensional L-C-A workload feature vector (RobustScaler-normalized, sim=exp(-\|\|dz\|\|)) in lca.py, and wire it into `aituner profile window` / `aituner profile similarity`. Covered by tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:24 +08:00
Gahow Wang	d0c89dac48	Clean marked trial engine processes	2026-05-16 15:51:04 +08:00
Gahow Wang	cf9b8b3f68	Clean vLLM process groups after parent exit	2026-05-16 14:52:05 +08:00
Gahow Wang	5a879a8592	Fix decode harness partial probe handling	2026-05-16 14:18:07 +08:00
Gahow Wang	5c2958e6c1	Constrain harness topology by visible GPUs	2026-05-13 01:25:31 +08:00
Gahow Wang	e3ed775afd	Fix harness SLO early-stop diagnosis	2026-05-12 22:20:01 +08:00
Gahow Wang	17e9681ca0	Add profile-driven harness planner	2026-05-12 21:28:44 +08:00
Gahow Wang	2d03b1cd4c	Add SLO-driven topology frontier harness guard	2026-05-12 21:00:49 +08:00
Gahow Wang	e1125475ae	Minimize no-harness ablation prompt	2026-05-12 09:42:53 +08:00
Gahow Wang	ae756600ce	Support full-range and incumbent-floor search modes	2026-05-11 12:58:46 +08:00
Gahow Wang	8516cd88c0	Use full search range for every trial	2026-05-11 12:50:22 +08:00
Gahow Wang	14259fcec9	Measure lower-range performance for infeasible trials	2026-05-10 14:30:34 +08:00
Gahow Wang	bdb08f6edc	Handle missing streamed token metrics	2026-05-10 02:40:00 +08:00
Gahow Wang	adc4351e5d	Report latency stats for infeasible baseline	2026-05-08 11:10:34 +08:00

1 2 3

108 Commits