aituner

Author	SHA1	Message	Date
Gahow Wang	4607711bb5	Add reusable clean pair runner	2026-06-22 00:05:31 +08:00
Gahow Wang	d23b69219b	Add clean dash1 harness ablation runner	2026-06-21 00:51:08 +08:00
Gahow Wang	488fae7e63	Add tuning progress report for harness evaluation	2026-06-21 00:48:21 +08:00
Gahow Wang	a9d237bbfd	Show effective flags in ablation trajectory	2026-06-20 10:24:53 +08:00
Gahow Wang	76cca89a43	Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:29:32 +08:00
Gahow Wang	83162e7a64	Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm - Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions in both ablation specs so both arms uniformly use the current ~/.codex model (the prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off). - run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:47 +08:00
Gahow Wang	95c02d7dd9	Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism) A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form guess), which weakens the single-curve story. Run naive 2 more times on the same real-output substrate to capture the fail/slow/lucky spread -- the actual finding. Waits for ABLATION12_DONE so it never contends for GPUs with the main pair. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:06:05 +08:00
Gahow Wang	0c23285f39	Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a feasible config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:24:00 +08:00
Gahow Wang	97d2ddabb1	Ablation driver: force direct LLM connection (codex proxy is dash0-local) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 10:05:44 +08:00
Gahow Wang	b779f6e56a	Add dash1 naive-completion driver for the ablation Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:52:54 +08:00
Gahow Wang	579dd86698	Ablation: --skip-baseline so loops climb from first proposal The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time compression, which tripped baseline_all_infeasible and terminated the loop before any climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal (harness steers to TP from the long-prompt profile) — the ablation is about the proposal path, so an explicit TP1 row is not required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:59:46 +08:00
Gahow Wang	37342a5749	Add chained harness-vs-naive ablation driver (sequential runs + DONE marker) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:30:41 +08:00
Gahow Wang	d975e57bb5	Scale ablation early-stop caps to the compressed window (scale=0.2) At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:49:57 +08:00
Gahow Wang	958739027a	Fix Stop-A validation config: system vllm, cap max-model-len Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:22:48 +08:00
Gahow Wang	08e53fd897	Add Stop-A calibration script (CPU-only convergence curve) Prints the offered-L-C-A convergence curve and the stop fraction at candidate tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare how late C converges across workloads. No serving required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:10:02 +08:00
Gahow Wang	0f15bbc3f1	Make the offered-load axis session-coherent Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the realized KV-cache hit rate as offered load dropped — so the feasibility boundary was measured on a workload with a different C than production, contradicting the paper's scale-stationary L-C-A premise. prepare_trace_windows now resolves each row's session root via the parent_chat_id chain in a single streaming pass and assigns sampling_u per session, so thresholding keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose parent fell outside the span fall back to grouping under the parent id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:16:06 +08:00
Gahow Wang	6f8e3c95c1	Unify harness L-C-A on the canonical lca.WorkloadProfile Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile` previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block authoritative: build_harness_context now accepts an optional workload_profile and renders the canonical 10-dim vector + per-family stats when present, falling back to the legacy rendering only when no profile is supplied (direct unit-test calls). Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the profile via lca.build_study_workload_profile and pass it through build_prompt. The heuristic regime classifiers keep reading window_summary; that is the heuristic layer, distinct from the similarity metric. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:12:17 +08:00
Gahow Wang	c1ff64381d	Harden trial measurement accounting	2026-05-06 21:18:09 +08:00
Gahow Wang	26f3b46966	compare: add multi-candidate runner	2026-04-13 20:50:39 +08:00
Gahow Wang	4625fba487	trace: make window materialization atomic	2026-04-12 23:09:30 +08:00
Gahow Wang	631a076498	trace: include weekend legacy windows	2026-04-12 22:43:02 +08:00
Gahow Wang	edfd61a696	Add qwen235b prefill docs and tight TTFT spec	2026-04-12 11:24:23 +08:00
Gahow Wang	7b7eaafd78	Use time-based trace window ids	2026-04-04 22:09:43 +08:00
Gahow Wang	4e1401f50c	Stream trace window materialization	2026-04-04 21:49:03 +08:00
Gahow Wang	69f666593e	Speed up raw trace window extraction	2026-04-04 21:42:02 +08:00
Gahow Wang	65b122fd4b	Add raw trace window preparation script	2026-04-04 21:37:51 +08:00

26 Commits