aituner

Author	SHA1	Message	Date
Gahow Wang	ee101a7c24	Harden prefill scheduler harness	2026-06-29 01:54:02 +08:00
Gahow Wang	bfd85793f3	Prioritize uncovered prefill scheduler candidates	2026-06-29 01:30:34 +08:00
Gahow Wang	36c301c128	Add normalized prefill scheduler harness	2026-06-29 01:12:19 +08:00
Gahow Wang	7ad439730e	Add llm-first tuning proposal policy	2026-06-27 12:21:51 +08:00
Gahow Wang	9accf2575e	Require harness proposals from candidate sets	2026-06-27 01:03:30 +08:00
Gahow Wang	bef260f183	Document bad-start robustness suite	2026-06-26 22:19:46 +08:00
Gahow Wang	2937539b49	Persist harness candidate set snapshots	2026-06-26 22:17:47 +08:00
Gahow Wang	5080b50315	Veto repeated materialized configs	2026-06-26 22:15:47 +08:00
Gahow Wang	825d3e03e9	Add harness candidate set audit	2026-06-26 22:02:09 +08:00
Gahow Wang	42f75553a6	Document full config signature validation	2026-06-26 21:52:18 +08:00
Gahow Wang	48911b658b	Use normalized full config signatures	2026-06-26 21:28:10 +08:00
Gahow Wang	7f50b8b8ea	Document bad-start validation results	2026-06-26 20:50:20 +08:00
Gahow Wang	c8a0f9870e	Tighten topology and auto-high validation	2026-06-26 20:07:23 +08:00
Gahow Wang	1dd3eaebaa	Add auto search high measurement policy	2026-06-26 20:05:22 +08:00
Gahow Wang	95ad124a1b	Document auto search high policy	2026-06-26 19:53:30 +08:00
Gahow Wang	384cb58f1f	Add declarative harness prototype	2026-06-26 18:07:02 +08:00
Gahow Wang	4075c7abf0	Design declarative intervention harness	2026-06-26 17:15:06 +08:00
Gahow Wang	92eb186006	Add bad-start harness recovery planning	2026-06-26 16:44:24 +08:00
Gahow Wang	ce36cd79af	Document no-LLM harness mechanism	2026-06-25 10:32:29 +08:00
Gahow Wang	013b01baa1	Stop after gmu ceiling validation is exhausted	2026-06-24 22:45:42 +08:00
Gahow Wang	b075afe6f2	Continue gmu hill-climb after topology validation	2026-06-24 19:09:35 +08:00
Gahow Wang	8fa758797e	Guard generic topology search from introducing EP	2026-06-24 15:21:22 +08:00
Gahow Wang	c245774d76	Ignore generated run configs	2026-06-24 11:48:21 +08:00
Gahow Wang	d85572e7b5	Update AITuner roadmap framing	2026-06-24 11:45:42 +08:00
Gahow Wang	c0a9235b80	Document vLLM-first harness roadmap	2026-06-24 11:23:39 +08:00
Gahow Wang	c4173b2b3b	Document remote proxy setup	2026-06-23 20:12:53 +08:00
Gahow Wang	6d874ecbff	Update Qwen235B progress snapshot	2026-06-23 18:24:57 +08:00
Gahow Wang	403ae2e2b7	Document Qwen235B 2x2 progress	2026-06-23 18:23:56 +08:00
Gahow Wang	861d754f29	Localize Qwen27B harness ablation doc	2026-06-23 18:14:35 +08:00
Gahow Wang	76ec19224c	Document Qwen27B 2x2 harness ablation	2026-06-23 10:08:46 +08:00
Gahow Wang	e67bc86240	Probe coupled prefill runtime knobs before stop	2026-06-22 19:30:23 +08:00
Gahow Wang	fd94ab9f3b	Prevent prefill convergence stop before seq probe	2026-06-22 14:43:55 +08:00
Gahow Wang	4607711bb5	Add reusable clean pair runner	2026-06-22 00:05:31 +08:00
Gahow Wang	d23b69219b	Add clean dash1 harness ablation runner	2026-06-21 00:51:08 +08:00
Gahow Wang	488fae7e63	Add tuning progress report for harness evaluation	2026-06-21 00:48:21 +08:00
Gahow Wang	426151bc9f	Harness stop uses full state baseline	2026-06-20 22:48:27 +08:00
Gahow Wang	a9d237bbfd	Show effective flags in ablation trajectory	2026-06-20 10:24:53 +08:00
Gahow Wang	5257fbc1a2	Improve harness incumbent follow-up search	2026-06-20 05:37:15 +08:00
Gahow Wang	b3156a382a	Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed) The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled = cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1) and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92 at iter 2 instead of climbing to TP4. The candidate path runs before the topology- frontier check, so a score>=0.35 runtime candidate wins. Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 13:33:29 +08:00
Gahow Wang	76cca89a43	Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:29:32 +08:00
Gahow Wang	83162e7a64	Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm - Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions in both ablation specs so both arms uniformly use the current ~/.codex model (the prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off). - run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:47 +08:00
Gahow Wang	a3523f5601	Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B The harness defined a gpu-memory-utilization family but hard-coded active_now=False and never generated a candidate for it, and only ever lowered max-num-seqs for decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873 (+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real coverage gap, not bad luck. Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology has moved off the baseline, so a baseline latency bottleneck still gets a TP change): - Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block a premature Stop-B until it is tried; the incumbent guard keeps the step only if per-GPU rate improves and the engine launches, and the tested signature terminates the climb (so 0.96 OOM/regression backs off to 0.94 automatically). - Let max-num-seqs rise for decode_tpot (not only fall) to exploit decode parallelism. - Activate the gpu-memory-utilization harness family for decode_tpot/admission. Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass. Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:25:47 +08:00
Gahow Wang	95c02d7dd9	Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism) A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form guess), which weakens the single-curve story. Run naive 2 more times on the same real-output substrate to capture the fail/slow/lucky spread -- the actual finding. Waits for ABLATION12_DONE so it never contends for GPUs with the main pair. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:06:05 +08:00
Gahow Wang	a1b804f879	Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes) Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2) and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4 boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility results; if a runtime-tuned config ever saturates 0.15 the harness search-high saturation stop fires (informative, not silent). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 22:11:52 +08:00
Gahow Wang	0c23285f39	Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a feasible config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:24:00 +08:00
Gahow Wang	816765071f	Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:03:26 +08:00
Gahow Wang	97d2ddabb1	Ablation driver: force direct LLM connection (codex proxy is dash0-local) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 10:05:44 +08:00
Gahow Wang	8e58b4033d	Note dash1 lacks LLM gateway access (naive-completion deferred to dash0) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:55:39 +08:00
Gahow Wang	b779f6e56a	Add dash1 naive-completion driver for the ablation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:52:54 +08:00
Gahow Wang	e7d1b3ba01	Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse refinements, premature LLM stop vetoed then honored -> converged, no regression. Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5 trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible config found. The bottleneck is compute; the harness steered to the knob family that adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:51:56 +08:00

1 2 3 4 5

213 Commits