Commit Graph

200 Commits

Author SHA1 Message Date
2fcaf80450 Wrap socket/timeout errors in HTTP client as HttpClientError
stream_chat_completion (and the LLM stream/chat paths) only caught HTTPError, so a
request exceeding request_timeout_s raised a raw TimeoutError mid-stream that escaped
_run_one_request (which only catches HttpClientError), propagated through the probe,
and crashed the whole trial ("failed: timed out"). A timed-out request is a failed
request (SLO miss), not a trial crash. Catch OSError (covers TimeoutError, URLError,
ConnectionError) after HTTPError and wrap it. Exposed by lowering request_timeout_s
to 180s on the 27B run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:58:28 +08:00
3541065675 Speed up 27B TP A/B: request_timeout 180s, search.high 0.125
The wide 0.5 range made TP1 (low-capacity) waste many infeasible high-theta probes,
and the 900s request timeout made overloaded probes drain hung requests for 15min
each. Cap drain at 180s and bound the search to where the boundaries actually are.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:40:42 +08:00
7678c7d5e8 Switch 27B TP A/B to length-aware TTFT SLO (4s + L_in/8k), widen search
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
ed2bbe0323 Add linear_ms SLO rule (length-aware TTFT budget)
threshold_ms = intercept_ms + per_token_ms * input_tokens. Lets the TTFT target
scale with prefill work, e.g. "4s + L_in/8k" => intercept_ms=4000, per_token_ms=0.125
(4s base, +1s per 8k input tokens). slo + spec + test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 20:35:23 +08:00
77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
4f45b546a1 Add 27B TP A/B (deterministic ground-truth: does TP2 beat TP1 per-GPU)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:39:54 +08:00
90c3eb51c8 Document Stop-B end-to-end validation (Phase 5)
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
0b6beafeb8 Phase 5: widen search.high to 1.0 to force multi-iteration Stop-B convergence
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:12:32 +08:00
d4aff81691 Add Stop-B end-to-end config (agentic loop, Stop-A enabled)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:05:39 +08:00
f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
03e556f0ab Add Stop-A ON config (adaptive_stop enabled + boundary guard) for A/B
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
dfc823f972 Add Stop-A SLO-boundary guard
When a truncated probe's measured pass-rate lands within trace.adaptive_stop.
boundary_delta of the SLO target, re-measure on the full window and use that
verdict. Offered-L-C-A convergence cannot see engine-state drift in the window
tail, so a near-knee truncated verdict is untrustworthy (validated: prefix 0.96
vs full 0.946 at threshold 0.08594). The guard fires only on feasibility-knee
probes, so non-boundary probes keep the Stop-A saving. Default delta=0.02.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:25:24 +08:00
9f52812753 Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:03:16 +08:00
958739027a Fix Stop-A validation config: system vllm, cap max-model-len
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:22:48 +08:00
0f57ee96a9 Drop LLM endpoint from Stop-A full-data config (baseline-only run)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:46 +08:00
43125f48cf Address review of two-stop branch
- lca._prefix_profile: anchor the prefix window to the prefix's own first arrival
  so the A-rate is measured over the prefix span (matches the design intent;
  no-op for the 0-based canonical pipeline).
- cli study tune: label file-originated stops as file_proposal rather than
  llm_after_veto_budget (the veto never applies to file proposals).
- spec.AdaptiveStopSpec: reject stable_checks > max_checks (would make
  convergence undetectable and silently disable Stop-A).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:19:08 +08:00
3af1d84ac0 Add Stop-A full-data validation config (real-time replay, no cap)
A single-config baseline run with adaptive_stop disabled and replay_time_scale=1.0,
so per-request probe_details capture the full 600s window for offline analysis of
whether truncating at the L-C-A convergence prefix preserves the feasibility verdict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:15:12 +08:00
08e53fd897 Add Stop-A calibration script (CPU-only convergence curve)
Prints the offered-L-C-A convergence curve and the stop fraction at candidate
tau_c values for a raw trace window, to calibrate Stop-A thresholds and compare
how late C converges across workloads. No serving required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:10:02 +08:00
a8f903498d Add Stop-B authority: deterministic validator overrides LLM stop
Phase 4 of the two-stop work. The harness already pre-empts the LLM with
deterministic stops and guided probes, but an LLM-originated should_stop could
still end the loop while the validator saw remaining opportunity.

Add harness._stop_authority, exposed as context["stop_authority"], whose
`authorized` mirrors the deterministic harness stop decision and whose
`opportunity_remains` flags an open topology frontier or a high-value planned
candidate. In study tune, an LLM-originated should_stop is now honored only when
the validator authorizes it; an unauthorized stop is vetoed (bounded budget) so
the loop cannot converge prematurely on the agent's say-so. File- and
harness-originated stops are unaffected, and the stop reason chain is recorded.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:45:14 +08:00
51a9e4a007 Add Stop-A: offered-L-C-A convergence early-stop for replay
Phase 2 of the two-stop work. The L-C-A vector is a deterministic function of the
trace's offered metadata, so the convergence of prefix-vs-full L-C-A (the paper's
Fig. 9 curve) can be computed up front rather than monitored live, with identical
result and no per-request overhead.

- lca.find_convergence_prefix: earliest arrival-ordered prefix whose L and A family
  similarities reach tau and the slow C family reaches the stricter tau_c for
  stable_checks consecutive checkpoints. Self-similarity uses the raw log-feature
  vector (same window -> identical per-dim spread; RobustScaler is reserved for the
  cross-window Stop-C). If C never converges it reports the full set, which is the
  C-gate: no early stop on a cold/under-warmed cache. The checkpoint sims double as
  Phase 3 calibration data.
- spec.AdaptiveStopSpec (trace.adaptive_stop), disabled by default until the
  thresholds are calibrated, so existing studies are unaffected.
- worker._adaptive_replay_set truncates each probe's replay to the convergence
  prefix and records a certificate (converged, fraction, family similarity) into
  probe history and probe_details. Offered request_rate at the threshold is
  unchanged; only wall-clock replay shrinks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:23:49 +08:00
0f15bbc3f1 Make the offered-load axis session-coherent
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.

prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:16:06 +08:00
6f8e3c95c1 Unify harness L-C-A on the canonical lca.WorkloadProfile
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).

Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:12:17 +08:00
8b4116fad0 Add reference paper and qwen27b tpot25 16-iter notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
27d1c8fa92 Add L-C-A workload profile metric and CLI profile commands
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:24 +08:00
984eb1f325 Document 8-GPU harness ablation results for qwen27b and qwen235b prefill
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)

Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.
2026-05-16 21:23:16 +08:00
d0c89dac48 Clean marked trial engine processes 2026-05-16 15:51:04 +08:00
cf9b8b3f68 Clean vLLM process groups after parent exit 2026-05-16 14:52:05 +08:00
5a879a8592 Fix decode harness partial probe handling 2026-05-16 14:18:07 +08:00
f18765b235 Document eight-GPU harness rerun 2026-05-13 09:04:14 +08:00
5c2958e6c1 Constrain harness topology by visible GPUs 2026-05-13 01:25:31 +08:00
fb6d74a18c Document harness v2 rerun criteria 2026-05-12 22:23:12 +08:00
e3ed775afd Fix harness SLO early-stop diagnosis 2026-05-12 22:20:01 +08:00
ef359c8eea Document profile-driven harness run 2026-05-12 21:40:19 +08:00
17e9681ca0 Add profile-driven harness planner 2026-05-12 21:28:44 +08:00
63d6a111f4 Document profile-driven harness design 2026-05-12 21:09:29 +08:00
2d03b1cd4c Add SLO-driven topology frontier harness guard 2026-05-12 21:00:49 +08:00
e1125475ae Minimize no-harness ablation prompt 2026-05-12 09:42:53 +08:00
ae756600ce Support full-range and incumbent-floor search modes 2026-05-11 12:58:46 +08:00
8516cd88c0 Use full search range for every trial 2026-05-11 12:50:22 +08:00
14259fcec9 Measure lower-range performance for infeasible trials 2026-05-10 14:30:34 +08:00
bf7c02e721 Clarify qwen27b raw per-iteration performance 2026-05-10 14:24:10 +08:00
b0325ecfd9 Clarify qwen235b raw per-iteration performance 2026-05-10 14:21:49 +08:00
4cfd3757b6 Document qwen235b prefill harness ablation 2026-05-10 13:05:49 +08:00
bdb08f6edc Handle missing streamed token metrics 2026-05-10 02:40:00 +08:00
307e2eb0e8 Document qwen27b harness ablation 2026-05-10 01:12:21 +08:00
adc4351e5d Report latency stats for infeasible baseline 2026-05-08 11:10:34 +08:00
eb137a0b62 Document TPOT40 baseline infeasible run 2026-05-08 02:57:03 +08:00
f212673f44 Stop tuning when baseline is infeasible 2026-05-08 01:07:36 +08:00
a7a5e9ad80 Make tune trial budget resumable 2026-05-07 17:18:06 +08:00
7263587cb6 clean: ci 2026-05-06 22:56:53 +08:00