aituner

Author	SHA1	Message	Date
Gahow Wang	1dd3eaebaa	Add auto search high measurement policy	2026-06-26 20:05:22 +08:00
Gahow Wang	95ad124a1b	Document auto search high policy	2026-06-26 19:53:30 +08:00
Gahow Wang	384cb58f1f	Add declarative harness prototype	2026-06-26 18:07:02 +08:00
Gahow Wang	4075c7abf0	Design declarative intervention harness	2026-06-26 17:15:06 +08:00
Gahow Wang	92eb186006	Add bad-start harness recovery planning	2026-06-26 16:44:24 +08:00
Gahow Wang	ce36cd79af	Document no-LLM harness mechanism	2026-06-25 10:32:29 +08:00
Gahow Wang	013b01baa1	Stop after gmu ceiling validation is exhausted	2026-06-24 22:45:42 +08:00
Gahow Wang	b075afe6f2	Continue gmu hill-climb after topology validation	2026-06-24 19:09:35 +08:00
Gahow Wang	8fa758797e	Guard generic topology search from introducing EP	2026-06-24 15:21:22 +08:00
Gahow Wang	c245774d76	Ignore generated run configs	2026-06-24 11:48:21 +08:00
Gahow Wang	d85572e7b5	Update AITuner roadmap framing	2026-06-24 11:45:42 +08:00
Gahow Wang	c0a9235b80	Document vLLM-first harness roadmap	2026-06-24 11:23:39 +08:00
Gahow Wang	c4173b2b3b	Document remote proxy setup	2026-06-23 20:12:53 +08:00
Gahow Wang	6d874ecbff	Update Qwen235B progress snapshot	2026-06-23 18:24:57 +08:00
Gahow Wang	403ae2e2b7	Document Qwen235B 2x2 progress	2026-06-23 18:23:56 +08:00
Gahow Wang	861d754f29	Localize Qwen27B harness ablation doc	2026-06-23 18:14:35 +08:00
Gahow Wang	76ec19224c	Document Qwen27B 2x2 harness ablation	2026-06-23 10:08:46 +08:00
Gahow Wang	e67bc86240	Probe coupled prefill runtime knobs before stop	2026-06-22 19:30:23 +08:00
Gahow Wang	fd94ab9f3b	Prevent prefill convergence stop before seq probe	2026-06-22 14:43:55 +08:00
Gahow Wang	4607711bb5	Add reusable clean pair runner	2026-06-22 00:05:31 +08:00
Gahow Wang	d23b69219b	Add clean dash1 harness ablation runner	2026-06-21 00:51:08 +08:00
Gahow Wang	488fae7e63	Add tuning progress report for harness evaluation	2026-06-21 00:48:21 +08:00
Gahow Wang	426151bc9f	Harness stop uses full state baseline	2026-06-20 22:48:27 +08:00
Gahow Wang	a9d237bbfd	Show effective flags in ablation trajectory	2026-06-20 10:24:53 +08:00
Gahow Wang	5257fbc1a2	Improve harness incumbent follow-up search	2026-06-20 05:37:15 +08:00
Gahow Wang	b3156a382a	Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed) The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled = cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1) and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92 at iter 2 instead of climbing to TP4. The candidate path runs before the topology- frontier check, so a score>=0.35 runtime candidate wins. Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 13:33:29 +08:00
Gahow Wang	76cca89a43	Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:29:32 +08:00
Gahow Wang	83162e7a64	Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm - Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions in both ablation specs so both arms uniformly use the current ~/.codex model (the prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off). - run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:47 +08:00
Gahow Wang	a3523f5601	Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B The harness defined a gpu-memory-utilization family but hard-coded active_now=False and never generated a candidate for it, and only ever lowered max-num-seqs for decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873 (+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real coverage gap, not bad luck. Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology has moved off the baseline, so a baseline latency bottleneck still gets a TP change): - Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block a premature Stop-B until it is tried; the incumbent guard keeps the step only if per-GPU rate improves and the engine launches, and the tested signature terminates the climb (so 0.96 OOM/regression backs off to 0.94 automatically). - Let max-num-seqs rise for decode_tpot (not only fall) to exploit decode parallelism. - Activate the gpu-memory-utilization harness family for decode_tpot/admission. Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass. Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:25:47 +08:00
Gahow Wang	95c02d7dd9	Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism) A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form guess), which weakens the single-curve story. Run naive 2 more times on the same real-output substrate to capture the fail/slow/lucky spread -- the actual finding. Waits for ABLATION12_DONE so it never contends for GPUs with the main pair. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:06:05 +08:00
Gahow Wang	a1b804f879	Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes) Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2) and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4 boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility results; if a runtime-tuned config ever saturates 0.15 the harness search-high saturation stop fires (informative, not silent). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 22:11:52 +08:00
Gahow Wang	0c23285f39	Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a feasible config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:24:00 +08:00
Gahow Wang	816765071f	Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:03:26 +08:00
Gahow Wang	97d2ddabb1	Ablation driver: force direct LLM connection (codex proxy is dash0-local) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 10:05:44 +08:00
Gahow Wang	8e58b4033d	Note dash1 lacks LLM gateway access (naive-completion deferred to dash0) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:55:39 +08:00
Gahow Wang	b779f6e56a	Add dash1 naive-completion driver for the ablation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:52:54 +08:00
Gahow Wang	e7d1b3ba01	Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse refinements, premature LLM stop vetoed then honored -> converged, no regression. Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5 trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible config found. The bottleneck is compute; the harness steered to the knob family that adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:51:56 +08:00
Gahow Wang	579dd86698	Ablation: --skip-baseline so loops climb from first proposal The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time compression, which tripped baseline_all_infeasible and terminated the loop before any climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal (harness steers to TP from the long-prompt profile) — the ablation is about the proposal path, so an explicit TP1 row is not required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:59:46 +08:00
Gahow Wang	37342a5749	Add chained harness-vs-naive ablation driver (sequential runs + DONE marker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:30:41 +08:00
Gahow Wang	5965f4fbbc	Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable) scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:29:30 +08:00
Gahow Wang	a1cbab0e69	Document harness-vs-naive ablation: setup, substrate calibration, blocker Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B: - both configs committed and validated on dash0 (differ only in use_harness + study_id), LLM auth + clean engine launch confirmed; - characterizes exactly what the harness toggles (Harnesses: prompt section with ranked bottleneck hypotheses + knob-family steering, deterministic guided/stop proposals, Stop-B validator/veto) vs naive; - substrate calibration from a real harness-ON run: at scale=0.2 the 180s elapsed cap fires correctly but TP1 is uniformly infeasible even at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a real baseline; comparability caveat documented. Honest status: full two-run sweep NOT completed in-session (~5-6 GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM teardown re-validated). Includes a precise continuation recipe and the scripts/ablation_trajectory.py helper (validated against a prior store). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:16:27 +08:00
Gahow Wang	0794efa249	Reduce ablation probe budget to 3 per trial for tractability First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260 preemptions on the lighter half of the trace; TP1 is decode-bound and the arrival-lag early-stop does not cut a decode-drain-bound probe). Cut search.max_probes 5->3 to bound binary-search steps per trial. Caps stay at elapsed=180/lag=30. Both configs still differ only in use_harness + study_id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:01:19 +08:00
Gahow Wang	d975e57bb5	Scale ablation early-stop caps to the compressed window (scale=0.2) At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:49:57 +08:00
Gahow Wang	a16016a876	Add harness vs naive ablation configs (27b, scale=0.2 substrate) Two configs identical except llm.use_harness and study_id, for the controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25, max_probes=5) keeps the ablation tractable; Stop-A stays enabled. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:31:23 +08:00
Gahow Wang	07f5d92e1d	Add consolidated two-stop summary doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:16:28 +08:00
Gahow Wang	f2ff0faebd	Document Stop-B end-to-end on dense 27B: the improving climb + no-regression Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x), each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00 and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding: at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven to an explicit Stop-B firing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 18:07:00 +08:00
Gahow Wang	4a64196a99	Add 27B Stop-B agentic-loop config (harness-driven, GPUs 2-7) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:46 +08:00
Gahow Wang	b17b213575	Tear down the engine on SIGTERM instead of orphaning it Killing `study tune` with a default SIGTERM skipped the finally blocks, leaving the vLLM engine and its EngineCore workers (which inherit the AITUNER_* marker env) alive on the GPUs — twice leaking GPU memory that needed a root reset. Install a SIGTERM handler in run_trial that raises KeyboardInterrupt so _terminate_process_tree runs, ignore SIGTERM during teardown so a second signal can't re-orphan it, and restore the prior handler afterward. Main-thread-guarded; unit-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:08:06 +08:00
Gahow Wang	93ce339d61	Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 01:54:40 +08:00
Gahow Wang	b1b74318f6	Pin 27B A/B to GPUs 2-7 (route around leaked GPU0/1 memory) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:01:22 +08:00

1 2 3 4

200 Commits