Commit Graph

74 Commits

Author SHA1 Message Date
0f15bbc3f1 Make the offered-load axis session-coherent
Phase 1 of the two-stop work. Subsampling the trace by per-request uniform score
broke multi-turn sessions (a kept turn-2 could lose its turn-1), which lowered the
realized KV-cache hit rate as offered load dropped — so the feasibility boundary
was measured on a workload with a different C than production, contradicting the
paper's scale-stationary L-C-A premise.

prepare_trace_windows now resolves each row's session root via the parent_chat_id
chain in a single streaming pass and assigns sampling_u per session, so thresholding
keeps or drops whole sessions and preserves intra-session prefix reuse. Rows whose
parent fell outside the span fall back to grouping under the parent id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:16:06 +08:00
6f8e3c95c1 Unify harness L-C-A on the canonical lca.WorkloadProfile
Phase 0 of the two-stop work. The prompt block labeled `workload_lca_profile`
previously re-derived L-C-A from summarize_window's ad-hoc percentiles, diverging
from the paper's 10-dim RobustScaler vector implemented in lca.py. Make that block
authoritative: build_harness_context now accepts an optional workload_profile and
renders the canonical 10-dim vector + per-family stats when present, falling back
to the legacy rendering only when no profile is supplied (direct unit-test calls).

Real call sites (study prompt/llm-propose/tune, run_baseline_then_llm) build the
profile via lca.build_study_workload_profile and pass it through build_prompt. The
heuristic regime classifiers keep reading window_summary; that is the heuristic
layer, distinct from the similarity metric.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:12:17 +08:00
27d1c8fa92 Add L-C-A workload profile metric and CLI profile commands
Implement the paper's 10-dimensional L-C-A workload feature vector
(RobustScaler-normalized, sim=exp(-||dz||)) in lca.py, and wire it into
`aituner profile window` / `aituner profile similarity`. Covered by tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:24 +08:00
d0c89dac48 Clean marked trial engine processes 2026-05-16 15:51:04 +08:00
cf9b8b3f68 Clean vLLM process groups after parent exit 2026-05-16 14:52:05 +08:00
5a879a8592 Fix decode harness partial probe handling 2026-05-16 14:18:07 +08:00
5c2958e6c1 Constrain harness topology by visible GPUs 2026-05-13 01:25:31 +08:00
e3ed775afd Fix harness SLO early-stop diagnosis 2026-05-12 22:20:01 +08:00
17e9681ca0 Add profile-driven harness planner 2026-05-12 21:28:44 +08:00
2d03b1cd4c Add SLO-driven topology frontier harness guard 2026-05-12 21:00:49 +08:00
e1125475ae Minimize no-harness ablation prompt 2026-05-12 09:42:53 +08:00
ae756600ce Support full-range and incumbent-floor search modes 2026-05-11 12:58:46 +08:00
8516cd88c0 Use full search range for every trial 2026-05-11 12:50:22 +08:00
14259fcec9 Measure lower-range performance for infeasible trials 2026-05-10 14:30:34 +08:00
bdb08f6edc Handle missing streamed token metrics 2026-05-10 02:40:00 +08:00
adc4351e5d Report latency stats for infeasible baseline 2026-05-08 11:10:34 +08:00
f212673f44 Stop tuning when baseline is infeasible 2026-05-08 01:07:36 +08:00
a7a5e9ad80 Make tune trial budget resumable 2026-05-07 17:18:06 +08:00
c1ff64381d Harden trial measurement accounting 2026-05-06 21:18:09 +08:00
f653af09a8 Stop harness when feasible probe reaches search high 2026-05-06 17:59:09 +08:00
5d96689ea6 Make harness runtime refinement memory safe 2026-05-06 17:37:31 +08:00
0622e23817 Guide harness runtime refinement after TP 2026-05-06 02:46:07 +08:00
50067c926d Add harness guided first topology probe 2026-05-06 02:28:46 +08:00
4c066c4e4e Stop harness when search high is saturated 2026-05-02 11:04:59 +08:00
4ef69cce78 Make harness stop conservative for ablation 2026-05-02 09:47:16 +08:00
1a3d628268 Add harness early stop ablation 2026-05-02 08:08:14 +08:00
6d3459c82d Document decode harness one-shot mechanism 2026-05-02 06:25:06 +08:00
9e5394b557 Inherit incumbent topology for runtime validation 2026-04-30 09:33:49 +08:00
f59919e21c Clarify base-relative validation patches 2026-04-30 06:52:09 +08:00
38ff4380e5 Make strong incumbent trigger validation phase 2026-04-28 20:54:05 +08:00
c9089cf4f0 Ignore non-SLO probe bookkeeping in bottleneck diagnosis 2026-04-28 06:58:38 +08:00
a9943e0240 Use probe sequence bottlenecks in harness 2026-04-28 06:57:45 +08:00
39aa47fbf1 Add generic decode-only harness guidance 2026-04-28 06:46:18 +08:00
29d0548e06 Stop after strong incumbent harness gains 2026-04-26 01:29:05 +08:00
6bac389aae Add infeasible plateau guard to harness 2026-04-25 18:49:23 +08:00
6c04b9dbbc Evaluate baseline before LLM tuning 2026-04-25 17:14:05 +08:00
2d7ebe50ee Drain inflight requests after early stop 2026-04-25 16:57:01 +08:00
2dc2815620 Make harness verification portable 2026-04-25 16:37:13 +08:00
2c5e9af02a Add harness-guided tuning prompts 2026-04-25 16:35:33 +08:00
4625fba487 trace: make window materialization atomic 2026-04-12 23:09:30 +08:00
631a076498 trace: include weekend legacy windows 2026-04-12 22:43:02 +08:00
3f20ddf87e Add qwen235b prefill-only tuning support 2026-04-11 21:00:02 +08:00
5e54e9c8f5 Add multi-window baseline vs tuned compare flow 2026-04-11 13:51:54 +08:00
83325b2f76 Reset new topology groups to full binary search 2026-04-11 00:36:45 +08:00
a4d54442db Fix topology-aware incumbents for qwen27b tuning 2026-04-11 00:32:41 +08:00
8d0777e5e2 Add topology-aware qwen27b 0-8k tuning 2026-04-10 17:41:54 +08:00
9422d43737 Prioritize topology exploration in decode tuning 2026-04-10 10:25:41 +08:00
d582a8ed1b Validate served model name consistency 2026-04-09 22:50:23 +08:00
ef78fe7eb5 Add topology-aware tuning constraints 2026-04-09 21:07:51 +08:00
7371d6635c Force codex stream to use chat completions 2026-04-09 14:49:40 +08:00