aituner

Author	SHA1	Message	Date
Gahow Wang	4075c7abf0	Design declarative intervention harness	2026-06-26 17:15:06 +08:00
Gahow Wang	92eb186006	Add bad-start harness recovery planning	2026-06-26 16:44:24 +08:00
Gahow Wang	ce36cd79af	Document no-LLM harness mechanism	2026-06-25 10:32:29 +08:00
Gahow Wang	013b01baa1	Stop after gmu ceiling validation is exhausted	2026-06-24 22:45:42 +08:00
Gahow Wang	b075afe6f2	Continue gmu hill-climb after topology validation	2026-06-24 19:09:35 +08:00
Gahow Wang	8fa758797e	Guard generic topology search from introducing EP	2026-06-24 15:21:22 +08:00
Gahow Wang	c245774d76	Ignore generated run configs	2026-06-24 11:48:21 +08:00
Gahow Wang	d85572e7b5	Update AITuner roadmap framing	2026-06-24 11:45:42 +08:00
Gahow Wang	c0a9235b80	Document vLLM-first harness roadmap	2026-06-24 11:23:39 +08:00
Gahow Wang	c4173b2b3b	Document remote proxy setup	2026-06-23 20:12:53 +08:00
Gahow Wang	6d874ecbff	Update Qwen235B progress snapshot	2026-06-23 18:24:57 +08:00
Gahow Wang	403ae2e2b7	Document Qwen235B 2x2 progress	2026-06-23 18:23:56 +08:00
Gahow Wang	861d754f29	Localize Qwen27B harness ablation doc	2026-06-23 18:14:35 +08:00
Gahow Wang	76ec19224c	Document Qwen27B 2x2 harness ablation	2026-06-23 10:08:46 +08:00
Gahow Wang	e67bc86240	Probe coupled prefill runtime knobs before stop	2026-06-22 19:30:23 +08:00
Gahow Wang	fd94ab9f3b	Prevent prefill convergence stop before seq probe	2026-06-22 14:43:55 +08:00
Gahow Wang	4607711bb5	Add reusable clean pair runner	2026-06-22 00:05:31 +08:00
Gahow Wang	d23b69219b	Add clean dash1 harness ablation runner	2026-06-21 00:51:08 +08:00
Gahow Wang	488fae7e63	Add tuning progress report for harness evaluation	2026-06-21 00:48:21 +08:00
Gahow Wang	426151bc9f	Harness stop uses full state baseline	2026-06-20 22:48:27 +08:00
Gahow Wang	a9d237bbfd	Show effective flags in ablation trajectory	2026-06-20 10:24:53 +08:00
Gahow Wang	5257fbc1a2	Improve harness incumbent follow-up search	2026-06-20 05:37:15 +08:00
Gahow Wang	b3156a382a	Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed) The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled = cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1) and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92 at iter 2 instead of climbing to TP4. The candidate path runs before the topology- frontier check, so a score>=0.35 runtime candidate wins. Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 13:33:29 +08:00
Gahow Wang	76cca89a43	Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:29:32 +08:00
Gahow Wang	83162e7a64	Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm - Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions in both ablation specs so both arms uniformly use the current ~/.codex model (the prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off). - run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:47 +08:00
Gahow Wang	a3523f5601	Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B The harness defined a gpu-memory-utilization family but hard-coded active_now=False and never generated a candidate for it, and only ever lowered max-num-seqs for decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873 (+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real coverage gap, not bad luck. Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology has moved off the baseline, so a baseline latency bottleneck still gets a TP change): - Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block a premature Stop-B until it is tried; the incumbent guard keeps the step only if per-GPU rate improves and the engine launches, and the tested signature terminates the climb (so 0.96 OOM/regression backs off to 0.94 automatically). - Let max-num-seqs rise for decode_tpot (not only fall) to exploit decode parallelism. - Activate the gpu-memory-utilization harness family for decode_tpot/admission. Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass. Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:25:47 +08:00
Gahow Wang	95c02d7dd9	Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism) A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form guess), which weakens the single-curve story. Run naive 2 more times on the same real-output substrate to capture the fail/slow/lucky spread -- the actual finding. Waits for ABLATION12_DONE so it never contends for GPUs with the main pair. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:06:05 +08:00
Gahow Wang	a1b804f879	Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes) Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2) and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4 boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility results; if a runtime-tuned config ever saturates 0.15 the harness search-high saturation stop fires (informative, not silent). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 22:11:52 +08:00
Gahow Wang	0c23285f39	Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a feasible config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:24:00 +08:00
Gahow Wang	816765071f	Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:03:26 +08:00
Gahow Wang	97d2ddabb1	Ablation driver: force direct LLM connection (codex proxy is dash0-local) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 10:05:44 +08:00
Gahow Wang	8e58b4033d	Note dash1 lacks LLM gateway access (naive-completion deferred to dash0) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:55:39 +08:00
Gahow Wang	b779f6e56a	Add dash1 naive-completion driver for the ablation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:52:54 +08:00
Gahow Wang	e7d1b3ba01	Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse refinements, premature LLM stop vetoed then honored -> converged, no regression. Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5 trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible config found. The bottleneck is compute; the harness steered to the knob family that adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:51:56 +08:00
Gahow Wang	579dd86698	Ablation: --skip-baseline so loops climb from first proposal The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time compression, which tripped baseline_all_infeasible and terminated the loop before any climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal (harness steers to TP from the long-prompt profile) — the ablation is about the proposal path, so an explicit TP1 row is not required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:59:46 +08:00
Gahow Wang	37342a5749	Add chained harness-vs-naive ablation driver (sequential runs + DONE marker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:30:41 +08:00
Gahow Wang	5965f4fbbc	Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable) scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:29:30 +08:00
Gahow Wang	a1cbab0e69	Document harness-vs-naive ablation: setup, substrate calibration, blocker Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B: - both configs committed and validated on dash0 (differ only in use_harness + study_id), LLM auth + clean engine launch confirmed; - characterizes exactly what the harness toggles (Harnesses: prompt section with ranked bottleneck hypotheses + knob-family steering, deterministic guided/stop proposals, Stop-B validator/veto) vs naive; - substrate calibration from a real harness-ON run: at scale=0.2 the 180s elapsed cap fires correctly but TP1 is uniformly infeasible even at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a real baseline; comparability caveat documented. Honest status: full two-run sweep NOT completed in-session (~5-6 GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM teardown re-validated). Includes a precise continuation recipe and the scripts/ablation_trajectory.py helper (validated against a prior store). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:16:27 +08:00
Gahow Wang	0794efa249	Reduce ablation probe budget to 3 per trial for tractability First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260 preemptions on the lighter half of the trace; TP1 is decode-bound and the arrival-lag early-stop does not cut a decode-drain-bound probe). Cut search.max_probes 5->3 to bound binary-search steps per trial. Caps stay at elapsed=180/lag=30. Both configs still differ only in use_harness + study_id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:01:19 +08:00
Gahow Wang	d975e57bb5	Scale ablation early-stop caps to the compressed window (scale=0.2) At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:49:57 +08:00
Gahow Wang	a16016a876	Add harness vs naive ablation configs (27b, scale=0.2 substrate) Two configs identical except llm.use_harness and study_id, for the controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25, max_probes=5) keeps the ablation tractable; Stop-A stays enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:31:23 +08:00
Gahow Wang	07f5d92e1d	Add consolidated two-stop summary doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:16:28 +08:00
Gahow Wang	f2ff0faebd	Document Stop-B end-to-end on dense 27B: the improving climb + no-regression Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x), each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00 and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding: at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven to an explicit Stop-B firing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 18:07:00 +08:00
Gahow Wang	4a64196a99	Add 27B Stop-B agentic-loop config (harness-driven, GPUs 2-7) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:46 +08:00
Gahow Wang	b17b213575	Tear down the engine on SIGTERM instead of orphaning it Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs, ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the prior handler afterward. Main-thread-guarded; unit-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:06 +08:00
Gahow Wang	93ce339d61	Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 01:54:40 +08:00
Gahow Wang	b1b74318f6	Pin 27B A/B to GPUs 2-7 (route around leaked GPU0/1 memory) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:01:22 +08:00
Gahow Wang	2fcaf80450	Wrap socket/timeout errors in HTTP client as HttpClientError stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped _run_one_request (which only catches HttpClientError), propagated through the probe, and crashed the whole trial ("failed: timed out"). A timed-out request is a failed request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError, ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s to 180s on the 27B run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:58:28 +08:00
Gahow Wang	3541065675	Speed up 27B TP A/B: request_timeout 180s, search.high 0.125 The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes, and the 900s request timeout made overloaded probes drain hung requests for 15min each. Cap drain at 180s and bound the search to where the boundaries actually are. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:40:42 +08:00
Gahow Wang	7678c7d5e8	Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00

1 2 3 4

197 Commits