From 816765071ffca9aa9196ccb3a56a052d17d5cf51 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Wed, 17 Jun 2026 13:03:26 +0800 Subject: [PATCH] Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 --- .../harness-vs-naive-20260616.md | 116 +++++++++++------- 1 file changed, 69 insertions(+), 47 deletions(-) diff --git a/docs/harness-ablation/harness-vs-naive-20260616.md b/docs/harness-ablation/harness-vs-naive-20260616.md index 695aa99..6a8dde5 100644 --- a/docs/harness-ablation/harness-vs-naive-20260616.md +++ b/docs/harness-ablation/harness-vs-naive-20260616.md @@ -8,70 +8,92 @@ substrate — the only difference is `use_harness` (configs `dash0_qwen27b_ablation_harness_on.json` / `..._naive_off.json`, verified to differ only in that flag + study_id). -- Model/host: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (run on dash0; cpfs shared with dash1). +- Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount). - Workload: chat 0–8k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%. - Substrate (process comparison, not absolute peak-rate): `replay_time_scale=0.5`, `completion_tokens_override=128`, Stop-A on, `search.high=0.25`, 6 probes, max-trials 6, **`--skip-baseline`** (the low-capacity TP1 auto-baseline is infeasible under this SLO+compression and would trip `baseline_all_infeasible`; skipping it lets both loops climb from their first proposal). -- This measures the tuning *process* (which knob family, convergence), not validated - peak-rate. +- This measures the tuning *process* (which knob family, convergence speed, stop + discipline), not a validated peak-rate. -## Result - -### Harness ON — converged to the right answer in 2 iterations +## Harness ON — converged in 2 iterations, then stopped | iter | proposer | config | per_gpu | outcome | | --- | --- | --- | --- | --- | | 1 | LLM (harness-guided) | TP2 | 0.247 | feasible | | 2 | harness (deterministic) | **TP4** | **0.340** | feasible — incumbent | | 3 | harness | TP4 + chunked-prefill + mbt=16384 | 0.333 | worse → rejected | -| (—) | LLM | `should_stop` | — | **VETOED** by validator ("decode TPOT still the bottleneck; adjacent probes weak") | +| (—) | LLM | `should_stop` | — | **VETOED** ("decode TPOT still the bottleneck; adjacent probes weak") | | 4 | LLM | TP2 + DP2 | 0.194 | worse → rejected | -| (—) | LLM | `should_stop` | STOP | honored (`llm_after_veto_budget`) | +| (—) | LLM | `should_stop` | **STOP** | honored after veto budget | -Incumbent **TP4 @ 0.340 req/s/GPU**; iters-to-best = 2; no regression (the two worse -refinements were correctly not adopted); the premature LLM stop was vetoed once, then -honored after the budget. +Best **TP4 @ 0.340**; iters-to-best = **2**; ran **4 trials then stopped** (Stop-B + +one veto of a premature stop); no regression. -### Naive OFF — wandered in the wrong knob family, never converged -| iter | config (TP never changed from 1) | outcome | -| --- | --- | --- | -| 1 | mbt=16384, seqs=128 | infeasible (`tpot>50`, `ttft>budget`) | -| 2 | mbt=32768, seqs=256, prefix-cache off, chunked | infeasible (same) | -| 3 | mbt=49152, seqs=384 | infeasible (same) | -| 4 | mbt=65536, seqs=512 | **FAILED** — engine crash (OOM at huge mbt) | -| 5 | mbt=57344, seqs=448 | interrupted by a dash0 outage | +## Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst -Incumbent **None** — no feasible config found in 5 trials. The naive LLM kept tuning -**runtime** knobs (batched-tokens / num-seqs / caching) and **never raised TP**. +The naive (free-form) `gpt-5.4` loop behaved very differently across two runs — it has +no harness steering and no stop logic: -## Interpretation (the headline) +**Run A (dash0, interrupted by an outage at trial-5):** kept **TP=1** the whole time and +cycled runtime knobs (`max-num-batched-tokens` 16k→65k, `max-num-seqs`, caching). All +trials **infeasible** (same `tpot>50` + `ttft>budget`), trial-4 **crashed the engine** +(OOM at mbt=65536). Found **no feasible config** in 5 trials — never tried raising TP. -The bottleneck here is **compute** (decode TPOT + prefill queueing). The harness -diagnosed it and steered straight to the knob family that adds compute — **tensor -parallelism** — reaching a feasible **TP4 @ 0.34 req/s/GPU in 2 iterations**, then -correctly rejecting weaker refinements and stopping. The naive tuner spent its whole -budget on **runtime knobs that cannot add compute**, never tried raising TP, found -**zero** feasible configs, and even crashed the engine. This is a clean, stark -quantification of the harness's value: **right-knob-family steering → fast convergence -+ no regression, vs aimless runtime wandering → non-convergence.** It reproduces the -paper's Figure-18 finding (harness converges in a few iters; the naive agentic tuner -wastes the budget). +**Run B (dash1, full budget):** +| iter | config | per_gpu | note | +| --- | --- | --- | --- | +| 1 | TP2 | 0.247 | feasible | +| 2 | TP2 + max-num-seqs=32 | 0.218 | worse | +| 3 | TP2 + mbt=12288 | 0.218 | worse | +| 4 | TP2 (re-proposal) | 0.218 | no gain | +| 5 | TP2 + gpu-mem-util=0.85 | 0.218 | worse | +| 6 | **TP4** | **0.340** | reaches the optimum — at the last trial | -## Caveats / honesty +Best **TP4 @ 0.340** — the *same* optimum as the harness — but iters-to-best = **6**, +it used the **entire budget with no early stop**, and trials 2–5 were detours (TP2 + +runtime tweaks, all worse than trial-1) before it stumbled onto TP4. -- Compressed substrate (scale=0.5, out=128) → the per-GPU numbers are *process* - comparators, not validated peak-rates; the **direction/convergence** is the result. -- The naive run was interrupted at trial-5 by a dash0 connectivity outage (not by the - experiment). The conclusion is already unambiguous (5 trials, never raised TP, all - infeasible / one crash). A confirmatory naive-to-completion was attempted on dash1 - (shared cpfs, GPUs free) but **dash1 cannot reach the LLM gateway** (`Connection - refused` to the codex endpoint — dash0 has outbound access, dash1 does not), so the - LLM-driven loop only runs on dash0. The full-budget naive run can be completed when - dash0 is back; it will not change the conclusion. -- LLM nondeterminism: a single run per arm. The harness arm's deterministic guided - proposals (TP4 at iter 2) and validator veto are reproducible; the naive arm's exact - path varies but its failure mode (wrong knob family, no TP) is the expected one. -- dash0/dash1 share the cpfs mount, so the run artifacts under `.aituner/abl-*` are - readable/continuable from either host. +## Comparison + +| | Harness ON | Naive OFF (B, dash1) | Naive OFF (A, dash0) | +| --- | --- | --- | --- | +| best per-GPU | 0.340 (TP4) | 0.340 (TP4) | none (failed) | +| iters-to-best | **2** | 6 | — | +| trials used | **4 (stopped)** | 6 (full budget, no stop) | 5 (interrupted) | +| stopped early? | yes (Stop-B + veto) | no | — | +| wasted trials | 2 (post-best refinements) | 4 (TP2+runtime detours) | 5 (runtime-only, infeasible) | +| path to optimum | direct (TP2→TP4) | slow (TP2→runtime detour→TP4) | wrong family (runtime on TP1) | + +## Interpretation (honest) + +The bottleneck is **compute** (decode TPOT + prefill queueing), which only a +compute-adding knob (**tensor parallelism**) fixes. Findings: + +1. **A strong frontier model can sometimes find the right knob unaided** — naive run B + eventually reached TP4 = 0.34, the same optimum as the harness. This matches the + paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for + structured guidance. So the harness's value is **not** "naive always fails." +2. **The harness's value is reliability, speed, and stop discipline.** With the harness: + converged in **2 iters** and **stopped at 4** (recognized convergence; vetoed a + premature stop). Naive: **3× slower** to the same answer (6 iters), **never stopped** + (burned the full budget on detours), and in run A **failed outright** — never tried + TP, found nothing, crashed the engine. Naive is **nondeterministic and unreliable**; + the harness is fast, monotone (no regression), and self-terminating. +3. This reproduces the paper's Figure-18 story: the harness converges in a few + iterations and stops, while the naive agentic tuner wastes the budget (and can fail + to converge entirely). + +## Caveats + +- Compressed substrate (scale=0.5, out=128) → per-GPU numbers are *process* comparators, + not validated peak-rates; the convergence behavior is the result. (The TP4 optimum + reproduced at 0.340 across the harness run and naive run B, a useful consistency check.) +- One run per arm per host; naive is nondeterministic (runs A and B differ markedly), + which is itself part of the finding. The harness arm's deterministic guided proposal + (TP4 at iter 2) and validator veto are reproducible. +- Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the + cpfs and ran the completion. The codex `config.toml` points at a dash0-local proxy + (`127.0.0.1:11235`); on dash1 the LLM endpoint must be reached directly (set empty + `*_proxy` env) — see `scripts/run_naive_d1.sh`.