Commit Graph

214 Commits

Author SHA1 Message Date
6b25d56c1f Gate GMU climb on measured improvement 2026-06-29 02:00:41 +08:00
ee101a7c24 Harden prefill scheduler harness 2026-06-29 01:54:02 +08:00
bfd85793f3 Prioritize uncovered prefill scheduler candidates 2026-06-29 01:30:34 +08:00
36c301c128 Add normalized prefill scheduler harness 2026-06-29 01:12:19 +08:00
7ad439730e Add llm-first tuning proposal policy 2026-06-27 12:21:51 +08:00
9accf2575e Require harness proposals from candidate sets 2026-06-27 01:03:30 +08:00
bef260f183 Document bad-start robustness suite 2026-06-26 22:19:46 +08:00
2937539b49 Persist harness candidate set snapshots 2026-06-26 22:17:47 +08:00
5080b50315 Veto repeated materialized configs 2026-06-26 22:15:47 +08:00
825d3e03e9 Add harness candidate set audit 2026-06-26 22:02:09 +08:00
42f75553a6 Document full config signature validation 2026-06-26 21:52:18 +08:00
48911b658b Use normalized full config signatures 2026-06-26 21:28:10 +08:00
7f50b8b8ea Document bad-start validation results 2026-06-26 20:50:20 +08:00
c8a0f9870e Tighten topology and auto-high validation 2026-06-26 20:07:23 +08:00
1dd3eaebaa Add auto search high measurement policy 2026-06-26 20:05:22 +08:00
95ad124a1b Document auto search high policy 2026-06-26 19:53:30 +08:00
384cb58f1f Add declarative harness prototype 2026-06-26 18:07:02 +08:00
4075c7abf0 Design declarative intervention harness 2026-06-26 17:15:06 +08:00
92eb186006 Add bad-start harness recovery planning 2026-06-26 16:44:24 +08:00
ce36cd79af Document no-LLM harness mechanism 2026-06-25 10:32:29 +08:00
013b01baa1 Stop after gmu ceiling validation is exhausted 2026-06-24 22:45:42 +08:00
b075afe6f2 Continue gmu hill-climb after topology validation 2026-06-24 19:09:35 +08:00
8fa758797e Guard generic topology search from introducing EP 2026-06-24 15:21:22 +08:00
c245774d76 Ignore generated run configs 2026-06-24 11:48:21 +08:00
d85572e7b5 Update AITuner roadmap framing 2026-06-24 11:45:42 +08:00
c0a9235b80 Document vLLM-first harness roadmap 2026-06-24 11:23:39 +08:00
c4173b2b3b Document remote proxy setup 2026-06-23 20:12:53 +08:00
6d874ecbff Update Qwen235B progress snapshot 2026-06-23 18:24:57 +08:00
403ae2e2b7 Document Qwen235B 2x2 progress 2026-06-23 18:23:56 +08:00
861d754f29 Localize Qwen27B harness ablation doc 2026-06-23 18:14:35 +08:00
76ec19224c Document Qwen27B 2x2 harness ablation 2026-06-23 10:08:46 +08:00
e67bc86240 Probe coupled prefill runtime knobs before stop 2026-06-22 19:30:23 +08:00
fd94ab9f3b Prevent prefill convergence stop before seq probe 2026-06-22 14:43:55 +08:00
4607711bb5 Add reusable clean pair runner 2026-06-22 00:05:31 +08:00
d23b69219b Add clean dash1 harness ablation runner 2026-06-21 00:51:08 +08:00
488fae7e63 Add tuning progress report for harness evaluation 2026-06-21 00:48:21 +08:00
426151bc9f Harness stop uses full state baseline 2026-06-20 22:48:27 +08:00
a9d237bbfd Show effective flags in ablation trajectory 2026-06-20 10:24:53 +08:00
5257fbc1a2 Improve harness incumbent follow-up search 2026-06-20 05:37:15 +08:00
b3156a382a Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed)
The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.

Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 13:33:29 +08:00
76cca89a43 Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:29:32 +08:00
83162e7a64 Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm
- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
  in both ablation specs so both arms uniformly use the current ~/.codex model (the
  prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
  instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:47 +08:00
a3523f5601 Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.

Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
  ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
  a premature Stop-B until it is tried; the incumbent guard keeps the step only if
  per-GPU rate improves and the engine launches, and the tested signature terminates
  the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.

Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:25:47 +08:00
95c02d7dd9 Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism)
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:06:05 +08:00
a1b804f879 Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes)
Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2)
and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary
search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the
most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4
boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility
results; if a runtime-tuned config ever saturates 0.15 the harness search-high
saturation stop fires (informative, not silent).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 22:11:52 +08:00
0c23285f39 Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
  0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
  (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
  scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
  >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
  far below the tau bar used everywhere else. New calibrator:
  scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
  wall-clock a *feasible* config needs to drain the LCA-admitted set
  (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
  outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
  the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
  hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.

12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 17:24:00 +08:00
816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00
97d2ddabb1 Ablation driver: force direct LLM connection (codex proxy is dash0-local)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 10:05:44 +08:00
8e58b4033d Note dash1 lacks LLM gateway access (naive-completion deferred to dash0)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:55:39 +08:00
b779f6e56a Add dash1 naive-completion driver for the ablation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:52:54 +08:00