Commit Graph

63 Commits

Author SHA1 Message Date
93ce339d61 Document 27B TP sweep: per-GPU rises sharply with TP (dense), opposite of MoE
Under the length-aware TTFT SLO (4s + L_in/8k), dense Qwen3.5-27B per-GPU throughput:
TP1=0.065, TP2=0.2925 (4.5x), TP4>=0.908 (>=14x, ceiling-saturated). TP1 is TPOT-bound
(one H20 can't decode a 27B under 50ms/token once batched); loosening TTFT didn't move
TP1, confirming TPOT is the binding constraint. Opposite of MoE 30B-A3B where TP1 was
best per-GPU. Validates the harness + length-aware SLO produce meaningful, non-saturated
measurements (TP1/TP2). TP4 saturated -> lower bound.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:54:40 +08:00
77af4ded2a Flag Stop-B e2e per-GPU trajectory as non-benchmark (saturation + smoke regime)
The reported trajectory validates the Stop-B mechanics only. TP2-DP2/TP4 saturated
the trace ceiling (best_sampling_u~0.98) so their per-GPU peak is underestimated, and
the run used the smoke regime (scale=0.1 + 512 cap). The TP1>TP2 ordering may be real
for the small-active MoE but this run cannot establish it; the 27B TP A/B is the valid
follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:40:38 +08:00
90c3eb51c8 Document Stop-B end-to-end validation (Phase 5)
Real gpt-5.4 agentic loop on Qwen3-30B-A3B/H20 with Stop-A enabled. Validates both
Stop-B paths: search-high-saturation (validator-authorized immediate stop) and
multi-iteration convergence. The TP1 baseline stays the per-GPU incumbent (2.90
req/s/GPU); TP/DP scaling raises raw throughput but lowers per-GPU efficiency and is
correctly never adopted (no regression). The Phase-4 authority model is exercised
live: a premature LLM stop is vetoed (validator_did_not_authorize_stop), then a later
justified stop is honored after the veto budget. EP launch-failures handled as
hard-negative evidence. Auditable reason chains throughout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:58:44 +08:00
f31e9ccfd5 Record Stop-A boundary-guard A/B: correct verdict, ~38% replay saved
With the guard enabled the binary search recovers best sampling_u=0.078125
(rate 2.30 req/s), identical to the full-replay baseline. The guard fired on
exactly the one feasibility-knee probe (0.08594, re-measured full -> infeasible);
the other three probes truncated to ~45-50%. Net ~38% replay saved on the trial
with no peak-rate overestimate. Stop-A + boundary guard is safe to enable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:57:53 +08:00
9f52812753 Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and
shows C-convergence difficulty is driven by signal noise (low-reuse chat) not
reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A
convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts
preserved; the one mismatch is a boundary false-positive at the feasibility knee
(prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered
L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:03:16 +08:00
8b4116fad0 Add reference paper and qwen27b tpot25 16-iter notes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:02:30 +08:00
984eb1f325 Document 8-GPU harness ablation results for qwen27b and qwen235b prefill
Add completed experiment results from dash0 runs after 2026-05-13:
- qwen27b chat 0-8k: harness +118.6% over no-harness (0.2696 vs 0.1233 req/s/GPU)
- qwen235b prefill TTFT 3s/6s/9s: harness +76.8% (0.3921 vs 0.2217 req/s/GPU)

Mark old 7-GPU and pre-5/13 docs as superseded. Update implementation
log with completed run status.
2026-05-16 21:23:16 +08:00
f18765b235 Document eight-GPU harness rerun 2026-05-13 09:04:14 +08:00
5c2958e6c1 Constrain harness topology by visible GPUs 2026-05-13 01:25:31 +08:00
fb6d74a18c Document harness v2 rerun criteria 2026-05-12 22:23:12 +08:00
ef359c8eea Document profile-driven harness run 2026-05-12 21:40:19 +08:00
17e9681ca0 Add profile-driven harness planner 2026-05-12 21:28:44 +08:00
63d6a111f4 Document profile-driven harness design 2026-05-12 21:09:29 +08:00
14259fcec9 Measure lower-range performance for infeasible trials 2026-05-10 14:30:34 +08:00
bf7c02e721 Clarify qwen27b raw per-iteration performance 2026-05-10 14:24:10 +08:00
b0325ecfd9 Clarify qwen235b raw per-iteration performance 2026-05-10 14:21:49 +08:00
4cfd3757b6 Document qwen235b prefill harness ablation 2026-05-10 13:05:49 +08:00
307e2eb0e8 Document qwen27b harness ablation 2026-05-10 01:12:21 +08:00
adc4351e5d Report latency stats for infeasible baseline 2026-05-08 11:10:34 +08:00
eb137a0b62 Document TPOT40 baseline infeasible run 2026-05-08 02:57:03 +08:00
d7df1ebdac Add open source project metadata
Some checks failed
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
2026-05-06 21:18:21 +08:00
871c4cfc02 Document qwen27b chat setup audit 2026-05-06 20:32:09 +08:00
98cd6dd81a Document qwen27b current config harness curve 2026-05-06 18:00:43 +08:00
5d96689ea6 Make harness runtime refinement memory safe 2026-05-06 17:37:31 +08:00
cf2e741550 Document high search rerun 2026-05-06 03:19:51 +08:00
915861b706 Document community vllm harness ablation 2026-05-02 11:17:24 +08:00
ccbf24ac47 Use time-compressed community vllm ablation 2026-05-02 10:03:59 +08:00
d3d4c234f6 Bound community vllm ablation replay 2026-05-02 09:58:56 +08:00
4ef69cce78 Make harness stop conservative for ablation 2026-05-02 09:47:16 +08:00
664aeb49b2 Use local cache for qwen30b vllm runs 2026-05-02 08:47:16 +08:00
1880e859b5 Use vllm cu129 wheel on dash0 2026-05-02 08:28:23 +08:00
e215827503 Use uv auto torch backend for vllm 0.20 2026-05-02 08:21:27 +08:00
a7c9518ef6 Use local vllm venv for dash0 community run 2026-05-02 08:17:04 +08:00
1a3d628268 Add harness early stop ablation 2026-05-02 08:08:14 +08:00
6d3459c82d Document decode harness one-shot mechanism 2026-05-02 06:25:06 +08:00
9e5394b557 Inherit incumbent topology for runtime validation 2026-04-30 09:33:49 +08:00
f59919e21c Clarify base-relative validation patches 2026-04-30 06:52:09 +08:00
46e9040613 Record decode validation follow-up 2026-04-28 21:20:41 +08:00
38ff4380e5 Make strong incumbent trigger validation phase 2026-04-28 20:54:05 +08:00
68cdaf56a8 Summarize qwen235b decode harness result 2026-04-28 20:36:17 +08:00
f982395aad Record qwen235b decode harness launch 2026-04-28 07:02:13 +08:00
c9089cf4f0 Ignore non-SLO probe bookkeeping in bottleneck diagnosis 2026-04-28 06:58:38 +08:00
a9943e0240 Use probe sequence bottlenecks in harness 2026-04-28 06:57:45 +08:00
39aa47fbf1 Add generic decode-only harness guidance 2026-04-28 06:46:18 +08:00
71902b9fc2 Record qwen235b harness convergence test 2026-04-27 18:59:25 +08:00
bc884f6701 Document AITuner harness behavior 2026-04-27 16:34:19 +08:00
a962781b6c Document qwen27b harness convergence curve 2026-04-26 01:32:18 +08:00
440f5b491b Record plateau guard verification 2026-04-25 18:50:23 +08:00
6bac389aae Add infeasible plateau guard to harness 2026-04-25 18:49:23 +08:00
6c04b9dbbc Evaluate baseline before LLM tuning 2026-04-25 17:14:05 +08:00