aituner

Author	SHA1	Message	Date
Gahow Wang	2fcaf80450	Wrap socket/timeout errors in HTTP client as HttpClientError stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped _run_one_request (which only catches HttpClientError), propagated through the probe, and crashed the whole trial ("failed: timed out"). A timed-out request is a failed request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError, ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s to 180s on the 27B run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:58:28 +08:00
Gahow Wang	3541065675	Speed up 27B TP A/B: request_timeout 180s, search.high 0.125 The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes, and the 900s request timeout made overloaded probes drain hung requests for 15min each. Cap drain at 180s and bound the search to where the boundaries actually are. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:40:42 +08:00
Gahow Wang	7678c7d5e8	Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00
Gahow Wang	ed2bbe0323	Add linear_ms SLO rule (length-aware TTFT budget) threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125 (4s base, +1s per 8k input tokens). slo + spec + test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 20:35:23 +08:00
Gahow Wang	77af4ded2a	Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime) The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid follow-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:40:38 +08:00
Gahow Wang	4f45b546a1	Add 27B TP A/B (deterministic ground-truth: does TP2 beat TP1 per-GPU) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:39:54 +08:00
Gahow Wang	90c3eb51c8	Document Stop-B end-to-end validation (Phase 5) Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both Stop-B paths: search-high-saturation (validator-authorized immediate stop) and multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90 req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is correctly never adopted (no regression). The Phase-4 authority model is exercised live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later justified stop is honored after the veto budget. EP launch-failures handled as hard-negative evidence. Auditable reason chains throughout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:58:44 +08:00
Gahow Wang	0b6beafeb8	Phase 5: widen search.high to 1.0 to force multi-iteration Stop-B convergence Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:12:32 +08:00
Gahow Wang	d4aff81691	Add Stop-B end-to-end config (agentic loop, Stop-A enabled) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:05:39 +08:00
Gahow Wang	f31e9ccfd5	Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved With the guard enabled the binary search recovers best sampling_u=0.078125 (rate 2.30 req/s), identical to the full-replay baseline. The guard fired on exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible); the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial with no peak-rate overestimate. Stop-A + boundary guard is safe to enable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:57:53 +08:00
Gahow Wang	03e556f0ab	Add Stop-A ON config (adaptive_stop enabled + boundary guard) for A/B Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:25:24 +08:00
Gahow Wang	dfc823f972	Add Stop-A SLO-boundary guard When a truncated probe's measured pass-rate lands within trace.adaptive_stop. boundary_delta of the SLO target, re-measure on the full window and use that verdict. Offered-L-C-A convergence cannot see engine-state drift in the window tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96 vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:25:24 +08:00
Gahow Wang	9f52812753	Document Stop-A validation: calibration + GPU fidelity check CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and shows C-convergence difficulty is driven by signal noise (low-reuse chat) not reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts preserved; the one mismatch is a boundary false-positive at the feasibility knee (prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:03:16 +08:00
Gahow Wang	958739027a	Fix Stop-A validation config: system vllm, cap max-model-len Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:22:48 +08:00
Gahow Wang	0f57ee96a9	Drop LLM endpoint from Stop-A full-data config (baseline-only run) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:19:46 +08:00
Gahow Wang	43125f48cf	Address review of two-stop branch - lca._prefix_profile: anchor the prefix window to the prefix's own first arrival so the A-rate is measured over the prefix span (matches the design intent; no-op for the 0-based canonical pipeline). - cli study tune: label file-originated stops as file_proposal rather than llm_after_veto_budget (the veto never applies to file proposals). - spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make convergence undetectable and silently disable Stop-A). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:19:08 +08:00
Gahow Wang	3af1d84ac0	Add Stop-A full-data validation config (real-time replay, no cap) A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0, so per-request probe_details capture the full 600s window for offline analysis of whether truncating at the L-C-A convergence prefix preserves the feasibility verdict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:15:12 +08:00
Gahow Wang	08e53fd897	Add Stop-A calibration script (CPU-only convergence curve) Prints the offered-L-C-A convergence curve and the stop fraction at candidate tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare how late C converges across workloads. No serving required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:10:02 +08:00
Gahow Wang	a8f903498d	Add Stop-B authority: deterministic validator overrides LLM stop Phase 4 of the two-stop work. The harness already pre-empts the LLM with deterministic stops and guided probes, but an LLM-originated should_stop could still end the loop while the validator saw remaining opportunity. Add harness._stop_authority, exposed as context["stop_authority"], whose `authorized` mirrors the deterministic harness stop decision and whose `opportunity_remains` flags an open topology frontier or a high-value planned candidate. In study tune, an LLM-originated should_stop is now honored only when the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so the loop cannot converge prematurely on the agent's say-so. File- and harness-originated stops are unaffected, and the stop reason chain is recorded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:45:14 +08:00
Gahow Wang	51a9e4a007	Add Stop-A: offered-L-C-A convergence early-stop for replay Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's Fig. 9 curve) can be computed up front rather than monitored live, with identical result and no per-request overhead. - lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family similarities reach tau and the slow C family reaches the stricter tau_c for stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature vector (same window -> identical per-dim spread; RobustScaler is reserved for the cross-window Stop-C). If C never converges it reports the full set, which is the C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as Phase 3 calibration data. - spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the thresholds are calibrated, so existing studies are unaffected. - worker._adaptive_replay_set truncates each probe's replay to the convergence prefix and records a certificate (converged, fraction, family similarity) into probe history and probe_details. Offered request_rate at the threshold is unchanged; only wall-clock replay shrinks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:23:49 +08:00
Gahow Wang	0f15bbc3f1	Make the offered-load axis session-coherent Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the realized KV-cache hit rate as offered load dropped — so the feasibility boundary was measured on a workload with a different C than production, contradicting the paper's scale-stationary L-C-A premise. prepare_trace_windows now resolves each row's session root via the parent_chat_id chain in a single streaming pass and assigns sampling_u per session, so thresholding keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose parent fell outside the span fall back to grouping under the parent id. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:16:06 +08:00
Gahow Wang	6f8e3c95c1	Unify harness L-C-A on the canonical lca.WorkloadProfile Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile` previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block authoritative: build_harness_context now accepts an optional workload_profile and renders the canonical 10-dim vector + per-family stats when present, falling back to the legacy rendering only when no profile is supplied (direct unit-test calls). Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the profile via lca.build_study_workload_profile and pass it through build_prompt. The heuristic regime classifiers keep reading window_summary; that is the heuristic layer, distinct from the similarity metric. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:12:17 +08:00
Gahow Wang	8b4116fad0	Add reference paper and qwen27b tpot25 16-iter notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:30 +08:00
Gahow Wang	27d1c8fa92	Add L-C-A workload profile metric and CLI profile commands Implement the paper's 10-dimensional L-C-A workload feature vector (RobustScaler-normalized, sim=exp(-\|\|dz\|\|)) in lca.py, and wire it into `aituner profile window` / `aituner profile similarity`. Covered by tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:02:24 +08:00
Gahow Wang	984eb1f325	Document 8-GPU harness ablation results for qwen27b and qwen235b prefill Add completed experiment results from dash0 runs after 2026-05-13: - qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU) - qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU) Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation log with completed run status.	2026-05-16 21:23:16 +08:00
Gahow Wang	d0c89dac48	Clean marked trial engine processes	2026-05-16 15:51:04 +08:00
Gahow Wang	cf9b8b3f68	Clean vLLM process groups after parent exit	2026-05-16 14:52:05 +08:00
Gahow Wang	5a879a8592	Fix decode harness partial probe handling	2026-05-16 14:18:07 +08:00
Gahow Wang	f18765b235	Document eight-GPU harness rerun	2026-05-13 09:04:14 +08:00
Gahow Wang	5c2958e6c1	Constrain harness topology by visible GPUs	2026-05-13 01:25:31 +08:00
Gahow Wang	fb6d74a18c	Document harness v2 rerun criteria	2026-05-12 22:23:12 +08:00
Gahow Wang	e3ed775afd	Fix harness SLO early-stop diagnosis	2026-05-12 22:20:01 +08:00
Gahow Wang	ef359c8eea	Document profile-driven harness run	2026-05-12 21:40:19 +08:00
Gahow Wang	17e9681ca0	Add profile-driven harness planner	2026-05-12 21:28:44 +08:00
Gahow Wang	63d6a111f4	Document profile-driven harness design	2026-05-12 21:09:29 +08:00
Gahow Wang	2d03b1cd4c	Add SLO-driven topology frontier harness guard	2026-05-12 21:00:49 +08:00
Gahow Wang	e1125475ae	Minimize no-harness ablation prompt	2026-05-12 09:42:53 +08:00
Gahow Wang	ae756600ce	Support full-range and incumbent-floor search modes	2026-05-11 12:58:46 +08:00
Gahow Wang	8516cd88c0	Use full search range for every trial	2026-05-11 12:50:22 +08:00
Gahow Wang	14259fcec9	Measure lower-range performance for infeasible trials	2026-05-10 14:30:34 +08:00
Gahow Wang	bf7c02e721	Clarify qwen27b raw per-iteration performance	2026-05-10 14:24:10 +08:00
Gahow Wang	b0325ecfd9	Clarify qwen235b raw per-iteration performance	2026-05-10 14:21:49 +08:00
Gahow Wang	4cfd3757b6	Document qwen235b prefill harness ablation	2026-05-10 13:05:49 +08:00
Gahow Wang	bdb08f6edc	Handle missing streamed token metrics	2026-05-10 02:40:00 +08:00
Gahow Wang	307e2eb0e8	Document qwen27b harness ablation	2026-05-10 01:12:21 +08:00
Gahow Wang	adc4351e5d	Report latency stats for infeasible baseline	2026-05-08 11:10:34 +08:00
Gahow Wang	eb137a0b62	Document TPOT40 baseline infeasible run	2026-05-08 02:57:03 +08:00
Gahow Wang	f212673f44	Stop tuning when baseline is infeasible	2026-05-08 01:07:36 +08:00
Gahow Wang	a7a5e9ad80	Make tune trial budget resumable	2026-05-07 17:18:06 +08:00
Gahow Wang	7263587cb6	clean: ci	2026-05-06 22:56:53 +08:00

1 2 3 4

200 Commits