aituner

Author	SHA1	Message	Date
Gahow Wang	92eb186006	Add bad-start harness recovery planning	2026-06-26 16:44:24 +08:00
Gahow Wang	ce36cd79af	Document no-LLM harness mechanism	2026-06-25 10:32:29 +08:00
Gahow Wang	d85572e7b5	Update AITuner roadmap framing	2026-06-24 11:45:42 +08:00
Gahow Wang	c0a9235b80	Document vLLM-first harness roadmap	2026-06-24 11:23:39 +08:00
Gahow Wang	6d874ecbff	Update Qwen235B progress snapshot	2026-06-23 18:24:57 +08:00
Gahow Wang	403ae2e2b7	Document Qwen235B 2x2 progress	2026-06-23 18:23:56 +08:00
Gahow Wang	861d754f29	Localize Qwen27B harness ablation doc	2026-06-23 18:14:35 +08:00
Gahow Wang	76ec19224c	Document Qwen27B 2x2 harness ablation	2026-06-23 10:08:46 +08:00
Gahow Wang	816765071f	Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:03:26 +08:00
Gahow Wang	8e58b4033d	Note dash1 lacks LLM gateway access (naive-completion deferred to dash0) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:55:39 +08:00
Gahow Wang	e7d1b3ba01	Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse refinements, premature LLM stop vetoed then honored -> converged, no regression. Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5 trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible config found. The bottleneck is compute; the harness steered to the knob family that adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:51:56 +08:00
Gahow Wang	a1cbab0e69	Document harness-vs-naive ablation: setup, substrate calibration, blocker Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B: - both configs committed and validated on dash0 (differ only in use_harness + study_id), LLM auth + clean engine launch confirmed; - characterizes exactly what the harness toggles (Harnesses: prompt section with ranked bottleneck hypotheses + knob-family steering, deterministic guided/stop proposals, Stop-B validator/veto) vs naive; - substrate calibration from a real harness-ON run: at scale=0.2 the 180s elapsed cap fires correctly but TP1 is uniformly infeasible even at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a real baseline; comparability caveat documented. Honest status: full two-run sweep NOT completed in-session (~5-6 GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM teardown re-validated). Includes a precise continuation recipe and the scripts/ablation_trajectory.py helper (validated against a prior store). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:16:27 +08:00
Gahow Wang	d975e57bb5	Scale ablation early-stop caps to the compressed window (scale=0.2) At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:49:57 +08:00
Gahow Wang	07f5d92e1d	Add consolidated two-stop summary doc Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:16:28 +08:00
Gahow Wang	f2ff0faebd	Document Stop-B end-to-end on dense 27B: the improving climb + no-regression Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x), each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00 and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding: at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven to an explicit Stop-B firing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 18:07:00 +08:00
Gahow Wang	93ce339d61	Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput: TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound (one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated measurements (TP1/TP2). TP4 saturated -> lower bound. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 01:54:40 +08:00
Gahow Wang	77af4ded2a	Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime) The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:40:38 +08:00
Gahow Wang	90c3eb51c8	Document Stop-B end-to-end validation (Phase 5) Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both Stop-B paths: search-high-saturation (validator-authorized immediate stop) and multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90 req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is correctly never adopted (no regression). The Phase-4 authority model is exercised live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later justified stop is honored after the veto budget. EP launch-failures handled as hard-negative evidence. Auditable reason chains throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:58:44 +08:00
Gahow Wang	f31e9ccfd5	Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved With the guard enabled the binary search recovers best sampling_u=0.078125 (rate 2.30 req/s), identical to the full-replay baseline. The guard fired on exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible); the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial with no peak-rate overestimate. Stop-A + boundary guard is safe to enable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:57:53 +08:00
Gahow Wang	9f52812753	Document Stop-A validation: calibration + GPU fidelity check CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and shows C-convergence difficulty is driven by signal noise (low-reuse chat) not reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts preserved; the one mismatch is a boundary false-positive at the feasibility knee (prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:03:16 +08:00
Gahow Wang	8b4116fad0	Add reference paper and qwen27b tpot25 16-iter notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:30 +08:00
Gahow Wang	984eb1f325	Document 8-GPU harness ablation results for qwen27b and qwen235b prefill Add completed experiment results from dash0 runs after 2026-05-13: - qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU) - qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU) Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation log with completed run status.	2026-05-16 21:23:16 +08:00
Gahow Wang	f18765b235	Document eight-GPU harness rerun	2026-05-13 09:04:14 +08:00
Gahow Wang	5c2958e6c1	Constrain harness topology by visible GPUs	2026-05-13 01:25:31 +08:00
Gahow Wang	fb6d74a18c	Document harness v2 rerun criteria	2026-05-12 22:23:12 +08:00
Gahow Wang	ef359c8eea	Document profile-driven harness run	2026-05-12 21:40:19 +08:00
Gahow Wang	17e9681ca0	Add profile-driven harness planner	2026-05-12 21:28:44 +08:00
Gahow Wang	63d6a111f4	Document profile-driven harness design	2026-05-12 21:09:29 +08:00
Gahow Wang	14259fcec9	Measure lower-range performance for infeasible trials	2026-05-10 14:30:34 +08:00
Gahow Wang	bf7c02e721	Clarify qwen27b raw per-iteration performance	2026-05-10 14:24:10 +08:00
Gahow Wang	b0325ecfd9	Clarify qwen235b raw per-iteration performance	2026-05-10 14:21:49 +08:00
Gahow Wang	4cfd3757b6	Document qwen235b prefill harness ablation	2026-05-10 13:05:49 +08:00
Gahow Wang	307e2eb0e8	Document qwen27b harness ablation	2026-05-10 01:12:21 +08:00
Gahow Wang	adc4351e5d	Report latency stats for infeasible baseline	2026-05-08 11:10:34 +08:00
Gahow Wang	eb137a0b62	Document TPOT40 baseline infeasible run	2026-05-08 02:57:03 +08:00
Gahow Wang	d7df1ebdac	Add open source project metadata Some checks failed CI / test (3.11) (push) Has been cancelled Details CI / test (3.12) (push) Has been cancelled Details	2026-05-06 21:18:21 +08:00
Gahow Wang	871c4cfc02	Document qwen27b chat setup audit	2026-05-06 20:32:09 +08:00
Gahow Wang	98cd6dd81a	Document qwen27b current config harness curve	2026-05-06 18:00:43 +08:00
Gahow Wang	5d96689ea6	Make harness runtime refinement memory safe	2026-05-06 17:37:31 +08:00
Gahow Wang	cf2e741550	Document high search rerun	2026-05-06 03:19:51 +08:00
Gahow Wang	915861b706	Document community vllm harness ablation	2026-05-02 11:17:24 +08:00
Gahow Wang	ccbf24ac47	Use time-compressed community vllm ablation	2026-05-02 10:03:59 +08:00
Gahow Wang	d3d4c234f6	Bound community vllm ablation replay	2026-05-02 09:58:56 +08:00
Gahow Wang	4ef69cce78	Make harness stop conservative for ablation	2026-05-02 09:47:16 +08:00
Gahow Wang	664aeb49b2	Use local cache for qwen30b vllm runs	2026-05-02 08:47:16 +08:00
Gahow Wang	1880e859b5	Use vllm cu129 wheel on dash0	2026-05-02 08:28:23 +08:00
Gahow Wang	e215827503	Use uv auto torch backend for vllm 0.20	2026-05-02 08:21:27 +08:00
Gahow Wang	a7c9518ef6	Use local vllm venv for dash0 community run	2026-05-02 08:17:04 +08:00
Gahow Wang	1a3d628268	Add harness early stop ablation	2026-05-02 08:08:14 +08:00
Gahow Wang	6d3459c82d	Document decode harness one-shot mechanism	2026-05-02 06:25:06 +08:00

1 2

78 Commits