Commit Graph

78 Commits

Author SHA1 Message Date
92eb186006 Add bad-start harness recovery planning 2026-06-26 16:44:24 +08:00
ce36cd79af Document no-LLM harness mechanism 2026-06-25 10:32:29 +08:00
d85572e7b5 Update AITuner roadmap framing 2026-06-24 11:45:42 +08:00
c0a9235b80 Document vLLM-first harness roadmap 2026-06-24 11:23:39 +08:00
6d874ecbff Update Qwen235B progress snapshot 2026-06-23 18:24:57 +08:00
403ae2e2b7 Document Qwen235B 2x2 progress 2026-06-23 18:23:56 +08:00
861d754f29 Localize Qwen27B harness ablation doc 2026-06-23 18:14:35 +08:00
76ec19224c Document Qwen27B 2x2 harness ablation 2026-06-23 10:08:46 +08:00
816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00
8e58b4033d Note dash1 lacks LLM gateway access (naive-completion deferred to dash0)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:55:39 +08:00
e7d1b3ba01 Harness-vs-naive ablation result: harness steers to TP & converges; naive wanders
Controlled use_harness on/off on dense 27B (same workload/SLO/substrate, only the flag
differs). Harness ON: TP2 -> TP4 (0.34 req/s/GPU) in 2 iters, rejected two worse
refinements, premature LLM stop vetoed then honored -> converged, no regression.
Naive OFF: kept TP=1 and cranked runtime knobs (mbt 16k->65k, seqs, caching), all 5
trials infeasible (same TPOT/TTFT compute bottleneck), one engine OOM crash, no feasible
config found. The bottleneck is compute; the harness steered to the knob family that
adds compute (TP) while naive wandered in knobs that cannot. Reproduces the paper's
Fig-18 finding. Substrate is compressed (process comparison, not peak-rate); naive run
was infra-interrupted at trial-5 (already conclusive). Read from cpfs via dash1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:51:56 +08:00
a1cbab0e69 Document harness-vs-naive ablation: setup, substrate calibration, blocker
Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
  use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
  section with ranked bottleneck hypotheses + knob-family steering,
  deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
  180s elapsed cap fires correctly but TP1 is uniformly infeasible even
  at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
  real baseline; comparability caveat documented.

Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:16:27 +08:00
d975e57bb5 Scale ablation early-stop caps to the compressed window (scale=0.2)
At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so
the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn
~15min each (the tractability hazard the brief flagged). Scale the caps
proportionately to the time axis: early_stop_max_elapsed_s 900->180,
early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain)
finish well inside 180s; overloaded probes die in ~3min. Both configs
still differ only in use_harness + study_id. Adds the ablation doc
skeleton and a read-only trajectory-extraction helper.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:49:57 +08:00
07f5d92e1d Add consolidated two-stop summary doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:16:28 +08:00
f2ff0faebd Document Stop-B end-to-end on dense 27B: the improving climb + no-regression
Real gpt-5.4 agentic loop raised per-GPU TP1 0.123 -> TP2 0.2925 -> TP4 1.0012 (8.1x),
each a correctly-diagnosed real gain; then a TP4 runtime tweak measured 0.942 < 1.00
and was correctly rejected (no regression). With the 30B run (validator stop + LLM-stop
veto), all Stop-B behaviors are now validated end-to-end. The SIGTERM-teardown fix was
validated in practice (clean engine teardown, no GPU leak on stop). Efficiency finding:
at scale=1.0, infeasible high-theta probes burn the 900s elapsed cap, so a practical
loop needs a lower cap; this is why the run was stopped after iter-4 rather than driven
to an explicit Stop-B firing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 18:07:00 +08:00
93ce339d61 Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00
77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
90c3eb51c8 Document Stop-B end-to-end validation (Phase 5)
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
9f52812753 Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:03:16 +08:00
8b4116fad0 Add reference paper and qwen27b tpot25 16-iter notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
984eb1f325 Document 8-GPU harness ablation results for qwen27b and qwen235b prefill
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)

Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.
2026-05-16 21:23:16 +08:00
f18765b235 Document eight-GPU harness rerun 2026-05-13 09:04:14 +08:00
5c2958e6c1 Constrain harness topology by visible GPUs 2026-05-13 01:25:31 +08:00
fb6d74a18c Document harness v2 rerun criteria 2026-05-12 22:23:12 +08:00
ef359c8eea Document profile-driven harness run 2026-05-12 21:40:19 +08:00
17e9681ca0 Add profile-driven harness planner 2026-05-12 21:28:44 +08:00
63d6a111f4 Document profile-driven harness design 2026-05-12 21:09:29 +08:00
14259fcec9 Measure lower-range performance for infeasible trials 2026-05-10 14:30:34 +08:00
bf7c02e721 Clarify qwen27b raw per-iteration performance 2026-05-10 14:24:10 +08:00
b0325ecfd9 Clarify qwen235b raw per-iteration performance 2026-05-10 14:21:49 +08:00
4cfd3757b6 Document qwen235b prefill harness ablation 2026-05-10 13:05:49 +08:00
307e2eb0e8 Document qwen27b harness ablation 2026-05-10 01:12:21 +08:00
adc4351e5d Report latency stats for infeasible baseline 2026-05-08 11:10:34 +08:00
eb137a0b62 Document TPOT40 baseline infeasible run 2026-05-08 02:57:03 +08:00
d7df1ebdac Add open source project metadata
Some checks failed
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
2026-05-06 21:18:21 +08:00
871c4cfc02 Document qwen27b chat setup audit 2026-05-06 20:32:09 +08:00
98cd6dd81a Document qwen27b current config harness curve 2026-05-06 18:00:43 +08:00
5d96689ea6 Make harness runtime refinement memory safe 2026-05-06 17:37:31 +08:00
cf2e741550 Document high search rerun 2026-05-06 03:19:51 +08:00
915861b706 Document community vllm harness ablation 2026-05-02 11:17:24 +08:00
ccbf24ac47 Use time-compressed community vllm ablation 2026-05-02 10:03:59 +08:00
d3d4c234f6 Bound community vllm ablation replay 2026-05-02 09:58:56 +08:00
4ef69cce78 Make harness stop conservative for ablation 2026-05-02 09:47:16 +08:00
664aeb49b2 Use local cache for qwen30b vllm runs 2026-05-02 08:47:16 +08:00
1880e859b5 Use vllm cu129 wheel on dash0 2026-05-02 08:28:23 +08:00
e215827503 Use uv auto torch backend for vllm 0.20 2026-05-02 08:21:27 +08:00
a7c9518ef6 Use local vllm venv for dash0 community run 2026-05-02 08:17:04 +08:00
1a3d628268 Add harness early stop ablation 2026-05-02 08:08:14 +08:00
6d3459c82d Document decode harness one-shot mechanism 2026-05-02 06:25:06 +08:00