Commit Graph

204 Commits

Author SHA1 Message Date
42f75553a6 Document full config signature validation 2026-06-26 21:52:18 +08:00
48911b658b Use normalized full config signatures 2026-06-26 21:28:10 +08:00
7f50b8b8ea Document bad-start validation results 2026-06-26 20:50:20 +08:00
c8a0f9870e Tighten topology and auto-high validation 2026-06-26 20:07:23 +08:00
1dd3eaebaa Add auto search high measurement policy 2026-06-26 20:05:22 +08:00
95ad124a1b Document auto search high policy 2026-06-26 19:53:30 +08:00
384cb58f1f Add declarative harness prototype 2026-06-26 18:07:02 +08:00
4075c7abf0 Design declarative intervention harness 2026-06-26 17:15:06 +08:00
92eb186006 Add bad-start harness recovery planning 2026-06-26 16:44:24 +08:00
ce36cd79af Document no-LLM harness mechanism 2026-06-25 10:32:29 +08:00
013b01baa1 Stop after gmu ceiling validation is exhausted 2026-06-24 22:45:42 +08:00
b075afe6f2 Continue gmu hill-climb after topology validation 2026-06-24 19:09:35 +08:00
8fa758797e Guard generic topology search from introducing EP 2026-06-24 15:21:22 +08:00
c245774d76 Ignore generated run configs 2026-06-24 11:48:21 +08:00
d85572e7b5 Update AITuner roadmap framing 2026-06-24 11:45:42 +08:00
c0a9235b80 Document vLLM-first harness roadmap 2026-06-24 11:23:39 +08:00
c4173b2b3b Document remote proxy setup 2026-06-23 20:12:53 +08:00
6d874ecbff Update Qwen235B progress snapshot 2026-06-23 18:24:57 +08:00
403ae2e2b7 Document Qwen235B 2x2 progress 2026-06-23 18:23:56 +08:00
861d754f29 Localize Qwen27B harness ablation doc 2026-06-23 18:14:35 +08:00
76ec19224c Document Qwen27B 2x2 harness ablation 2026-06-23 10:08:46 +08:00
e67bc86240 Probe coupled prefill runtime knobs before stop 2026-06-22 19:30:23 +08:00
fd94ab9f3b Prevent prefill convergence stop before seq probe 2026-06-22 14:43:55 +08:00
4607711bb5 Add reusable clean pair runner 2026-06-22 00:05:31 +08:00
d23b69219b Add clean dash1 harness ablation runner 2026-06-21 00:51:08 +08:00
488fae7e63 Add tuning progress report for harness evaluation 2026-06-21 00:48:21 +08:00
426151bc9f Harness stop uses full state baseline 2026-06-20 22:48:27 +08:00
a9d237bbfd Show effective flags in ablation trajectory 2026-06-20 10:24:53 +08:00
5257fbc1a2 Improve harness incumbent follow-up search 2026-06-20 05:37:15 +08:00
b3156a382a Harness: gate gpu-mem-util/seqs-raise on 'no untested TP increase' (frontier-closed)
The first gpt-5.5 verification run exposed a bug in the prior gate: topology_settled =
cur_tp>base_tp let gpu-memory-utilization fire on a TP2 incumbent (TP2>baseline TP1)
and preempt the still-open TP4 frontier -- the harness proposed TP2+gpu-mem-util=0.92
at iter 2 instead of climbing to TP4. The candidate path runs before the topology-
frontier check, so a score>=0.35 runtime candidate wins.

Fix: gate runtime micro-tuning (gpu-mem-util, raising max-num-seqs) on the TP frontier
being closed -- topology_settled = no untested _next_allowed_tp remains (respects GPU
count, so TP4 is the real ceiling on 6 GPUs). New regression test: TP2 incumbent with
TP4 reachable must climb TP and must NOT propose gpu-mem-util. 116 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 13:33:29 +08:00
76cca89a43 Add harness-only dash1 driver to verify the gpu-mem-util fix recovers ~0.87 + stops
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:29:32 +08:00
83162e7a64 Ablation: pin gpt-5.5 @ ai.gahow.org (chat.completions); re-read token per arm
- Pin endpoint.model=gpt-5.5, base_url=https://ai.gahow.org/v1, wire_api=chat.completions
  in both ablation specs so both arms uniformly use the current ~/.codex model (the
  prior runs used the stale ai.prism.uno/gpt-5.4 that config.toml has since moved off).
- run_ablation_pair_d1.sh re-reads the codex token from auth.json right before each arm
  instead of capturing it once at launch (the stale-at-use capture 401'd naive 2/3).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:47 +08:00
a3523f5601 Harness: explore gpu-memory-utilization (and raise max-num-seqs) before Stop-B
The harness defined a gpu-memory-utilization family but hard-coded active_now=False
and never generated a candidate for it, and only ever *lowered* max-num-seqs for
decode_tpot. So on the decode-bound 27B incumbent it stopped at TP4=0.648 while the
naive (use_harness=false) baseline freely found gpu-memory-utilization=0.94 -> 0.873
(+35%) and max-num-seqs=48. That made the harness look worse than naive -- a real
coverage gap, not bad luck.

Fix in _runtime_candidate_actions (topology-before-runtime gated: only once topology
has moved off the baseline, so a baseline latency bottleneck still gets a TP change):
- Add a gpu-memory-utilization hill-climb candidate (+0.02/step toward a 0.97 safe
  ceiling) for decode_tpot/admission incumbents, scored high enough (>=0.35) to block
  a premature Stop-B until it is tried; the incumbent guard keeps the step only if
  per-GPU rate improves and the engine launches, and the tested signature terminates
  the climb (so 0.96 OOM/regression backs off to 0.94 automatically).
- Let max-num-seqs *rise* for decode_tpot (not only fall) to exploit decode parallelism.
- Activate the gpu-memory-utilization harness family for decode_tpot/admission.

Verified: new unit test asserts a settled TP4 decode-bound incumbent gets a
gpu-memory-utilization raise (0.9->0.92) and no stop while untried. 115 tests pass.
Empirical reliability (harness recovers ~0.87 and stops) to be confirmed by re-run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:25:47 +08:00
95c02d7dd9 Fig-18: chained driver for 2 extra naive runs (n=3 nondeterminism)
A single naive run can luck into the TP4 optimum at iter 1 (gpt-5.4 free-form
guess), which weakens the single-curve story. Run naive 2 more times on the same
real-output substrate to capture the fail/slow/lucky spread -- the actual finding.
Waits for ABLATION12_DONE so it never contends for GPUs with the main pair.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:06:05 +08:00
a1b804f879 Ablation: search.high 0.25 -> 0.15 (skip wildly-infeasible top probes)
Smoke on the real-output substrate measured feasible sampling_u = 0.0156 (TP2)
and 0.0742 (TP4, per-GPU 0.618 = 2.24x TP2). search.high=0.25 made the binary
search waste its two top probes (u=0.125/0.0625, always infeasible, admitting the
most long-output requests) on every trial. 0.15 keeps ~2x headroom over the TP4
boundary (0.0742) and trims ~15-20% of per-trial cost with identical feasibility
results; if a runtime-tuned config ever saturates 0.15 the harness search-high
saturation stop fires (informative, not silent).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 22:11:52 +08:00
0c23285f39 Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline
Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one:
- Use the trace's real output_length (drop completion_tokens_override=128). The
  0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode
  (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap.
- replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest
  scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays
  >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis
  far below the tau bar used everywhere else. New calibrator:
  scripts/calibrate_time_scale.py.
- Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the
  wall-clock a *feasible* config needs to drain the LCA-admitted set
  (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real
  outputs decode dominates wall-clock, so the old fixed 320s cap would truncate
  the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a
  hard ceiling; the per-probe deadline governs. The lag cap still cuts overload.

12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound):
scripts/run_ablation_pair_d1.sh. 115 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 17:24:00 +08:00
816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00
97d2ddabb1 Ablation driver: force direct LLM connection (codex proxy is dash0-local)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 10:05:44 +08:00
8e58b4033d Note dash1 lacks LLM gateway access (naive-completion deferred to dash0)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:55:39 +08:00
b779f6e56a Add dash1 naive-completion driver for the ablation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:52:54 +08:00
e7d1b3ba01 Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:51:56 +08:00
579dd86698 Ablation: --skip-baseline so loops climb from first proposal
The low-capacity TP1 auto-baseline is infeasible under tight TTFT/TPOT + time
compression, which tripped baseline_all_infeasible and terminated the loop before any
climb. Skip the auto-baseline so both runs start from the first LLM/harness proposal
(harness steers to TP from the long-prompt profile) — the ablation is about the
proposal path, so an explicit TP1 row is not required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:59:46 +08:00
37342a5749 Add chained harness-vs-naive ablation driver (sequential runs + DONE marker)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:30:41 +08:00
5965f4fbbc Ablation substrate: scale=0.5 + out=128 + 6 probes (TP1 measurable, tractable)
scale=0.2 made TP1 uniformly infeasible (no baseline); bound decode to 128 tokens and
use mild 2x compression so TP1 registers a real, fast baseline, with 6 probes to span
TP1's low and TP4's high feasibility boundaries. Both configs identical except use_harness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:29:30 +08:00
a1cbab0e69 Document harness-vs-naive ablation: setup, substrate calibration, blocker
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
  use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
  section with ranked bottleneck hypotheses + knob-family steering,
  deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
  180s elapsed cap fires correctly but TP1 is uniformly infeasible even
  at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
  real baseline; comparability caveat documented.

Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:16:27 +08:00
0794efa249 Reduce ablation probe budget to 3 per trial for tractability
First TP1 baseline probe under scale=0.2 ran ~6min (severe overload, 260
preemptions on the lighter half of the trace; TP1 is decode-bound and the
arrival-lag early-stop does not cut a decode-drain-bound probe). Cut
search.max_probes 5->3 to bound binary-search steps per trial. Caps stay
at elapsed=180/lag=30. Both configs still differ only in use_harness +
study_id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:01:19 +08:00
d975e57bb5 Scale ablation early-stop caps to the compressed window (scale=0.2)
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:49:57 +08:00
a16016a876 Add harness vs naive ablation configs (27b, scale=0.2 substrate)
Two configs identical except llm.use_harness and study_id, for the
controlled harness-ON vs naive-OFF tuning-trajectory ablation on dense
Qwen3.5-27B. Faster substrate (replay_time_scale=0.2, search.high=0.25,
max_probes=5) keeps the ablation tractable; Stop-A stays enabled.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:31:23 +08:00
07f5d92e1d Add consolidated two-stop summary doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:16:28 +08:00
f2ff0faebd Document Stop-B end-to-end on dense 27B: the improving climb + no-regression
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 18:07:00 +08:00