Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic

Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00
parent 97d2ddabb1
commit 816765071f
1 changed files with 69 additions and 47 deletions
--- a/docs/harness-ablation/harness-vs-naive-20260616.md
+++ b/docs/harness-ablation/harness-vs-naive-20260616.md
@@ -8,70 +8,92 @@ substrate — the only difference is `use_harness` (configs
 `dash0_qwen27b_ablation_harness_on.json` / `..._naive_off.json`, verified to differ
 only in that flag + study_id).

- Model/host: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (run on dash0; cpfs shared with dash1).
+- Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount).
 - Workload: chat 0–8k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%.
 - Substrate (process comparison, not absolute peak-rate): `replay_time_scale=0.5`,
  `completion_tokens_override=128`, Stop-A on, `search.high=0.25`, 6 probes, max-trials 6,
  **`--skip-baseline`** (the low-capacity TP1 auto-baseline is infeasible under this
  SLO+compression and would trip `baseline_all_infeasible`; skipping it lets both loops
  climb from their first proposal).
- This measures the tuning *process* (which knob family, convergence), not validated
-  peak-rate.
+- This measures the tuning *process* (which knob family, convergence speed, stop
+  discipline), not a validated peak-rate.

-## Result
-
-### Harness ON — converged to the right answer in 2 iterations
+## Harness ON — converged in 2 iterations, then stopped
 | iter | proposer | config | per_gpu | outcome |
 | --- | --- | --- | --- | --- |
 | 1 | LLM (harness-guided) | TP2 | 0.247 | feasible |
 | 2 | harness (deterministic) | **TP4** | **0.340** | feasible — incumbent |
 | 3 | harness | TP4 + chunked-prefill + mbt=16384 | 0.333 | worse → rejected |
-| (—) | LLM | `should_stop` | — | **VETOED** by validator ("decode TPOT still the bottleneck; adjacent probes weak") |
+| (—) | LLM | `should_stop` | — | **VETOED** ("decode TPOT still the bottleneck; adjacent probes weak") |
 | 4 | LLM | TP2 + DP2 | 0.194 | worse → rejected |
-| (—) | LLM | `should_stop` | STOP | honored (`llm_after_veto_budget`) |
+| (—) | LLM | `should_stop` | **STOP** | honored after veto budget |

-Incumbent **TP4 @ 0.340 req/s/GPU**; iters-to-best = 2; no regression (the two worse
-refinements were correctly not adopted); the premature LLM stop was vetoed once, then
-honored after the budget.
+Best **TP4 @ 0.340**; iters-to-best = **2**; ran **4 trials then stopped** (Stop-B +
+one veto of a premature stop); no regression.

-### Naive OFF — wandered in the wrong knob family, never converged
-| iter | config (TP never changed from 1) | outcome |
-| --- | --- | --- |
-| 1 | mbt=16384, seqs=128 | infeasible (`tpot>50`, `ttft>budget`) |
-| 2 | mbt=32768, seqs=256, prefix-cache off, chunked | infeasible (same) |
-| 3 | mbt=49152, seqs=384 | infeasible (same) |
-| 4 | mbt=65536, seqs=512 | **FAILED** — engine crash (OOM at huge mbt) |
-| 5 | mbt=57344, seqs=448 | interrupted by a dash0 outage |
+## Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst

-Incumbent **None** — no feasible config found in 5 trials. The naive LLM kept tuning
-**runtime** knobs (batched-tokens / num-seqs / caching) and **never raised TP**.
+The naive (free-form) `gpt-5.4` loop behaved very differently across two runs — it has
+no harness steering and no stop logic:

-## Interpretation (the headline)
+**Run A (dash0, interrupted by an outage at trial-5):** kept **TP=1** the whole time and
+cycled runtime knobs (`max-num-batched-tokens` 16k→65k, `max-num-seqs`, caching). All
+trials **infeasible** (same `tpot>50` + `ttft>budget`), trial-4 **crashed the engine**
+(OOM at mbt=65536). Found **no feasible config** in 5 trials — never tried raising TP.

-The bottleneck here is **compute** (decode TPOT + prefill queueing). The harness
-diagnosed it and steered straight to the knob family that adds compute — **tensor
-parallelism** — reaching a feasible **TP4 @ 0.34 req/s/GPU in 2 iterations**, then
-correctly rejecting weaker refinements and stopping. The naive tuner spent its whole
-budget on **runtime knobs that cannot add compute**, never tried raising TP, found
-**zero** feasible configs, and even crashed the engine. This is a clean, stark
-quantification of the harness's value: **right-knob-family steering → fast convergence
-+ no regression, vs aimless runtime wandering → non-convergence.** It reproduces the
-paper's Figure-18 finding (harness converges in a few iters; the naive agentic tuner
-wastes the budget).
+**Run B (dash1, full budget):**
+| iter | config | per_gpu | note |
+| --- | --- | --- | --- |
+| 1 | TP2 | 0.247 | feasible |
+| 2 | TP2 + max-num-seqs=32 | 0.218 | worse |
+| 3 | TP2 + mbt=12288 | 0.218 | worse |
+| 4 | TP2 (re-proposal) | 0.218 | no gain |
+| 5 | TP2 + gpu-mem-util=0.85 | 0.218 | worse |
+| 6 | **TP4** | **0.340** | reaches the optimum — at the last trial |

-## Caveats / honesty
+Best **TP4 @ 0.340** — the *same* optimum as the harness — but iters-to-best = **6**,
+it used the **entire budget with no early stop**, and trials 2–5 were detours (TP2 +
+runtime tweaks, all worse than trial-1) before it stumbled onto TP4.

- Compressed substrate (scale=0.5, out=128) → the per-GPU numbers are *process*
-  comparators, not validated peak-rates; the **direction/convergence** is the result.
- The naive run was interrupted at trial-5 by a dash0 connectivity outage (not by the
-  experiment). The conclusion is already unambiguous (5 trials, never raised TP, all
-  infeasible / one crash). A confirmatory naive-to-completion was attempted on dash1
-  (shared cpfs, GPUs free) but **dash1 cannot reach the LLM gateway** (`Connection
-  refused` to the codex endpoint — dash0 has outbound access, dash1 does not), so the
-  LLM-driven loop only runs on dash0. The full-budget naive run can be completed when
-  dash0 is back; it will not change the conclusion.
- LLM nondeterminism: a single run per arm. The harness arm's deterministic guided
-  proposals (TP4 at iter 2) and validator veto are reproducible; the naive arm's exact
-  path varies but its failure mode (wrong knob family, no TP) is the expected one.
- dash0/dash1 share the cpfs mount, so the run artifacts under `.aituner/abl-*` are
-  readable/continuable from either host.
+## Comparison
+
+| | Harness ON | Naive OFF (B, dash1) | Naive OFF (A, dash0) |
+| --- | --- | --- | --- |
+| best per-GPU | 0.340 (TP4) | 0.340 (TP4) | none (failed) |
+| iters-to-best | **2** | 6 | — |
+| trials used | **4 (stopped)** | 6 (full budget, no stop) | 5 (interrupted) |
+| stopped early? | yes (Stop-B + veto) | no | — |
+| wasted trials | 2 (post-best refinements) | 4 (TP2+runtime detours) | 5 (runtime-only, infeasible) |
+| path to optimum | direct (TP2→TP4) | slow (TP2→runtime detour→TP4) | wrong family (runtime on TP1) |
+
+## Interpretation (honest)
+
+The bottleneck is **compute** (decode TPOT + prefill queueing), which only a
+compute-adding knob (**tensor parallelism**) fixes. Findings:
+
+1. **A strong frontier model can sometimes find the right knob unaided** — naive run B
+   eventually reached TP4 = 0.34, the same optimum as the harness. This matches the
+   paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for
+   structured guidance. So the harness's value is **not** "naive always fails."
+2. **The harness's value is reliability, speed, and stop discipline.** With the harness:
+   converged in **2 iters** and **stopped at 4** (recognized convergence; vetoed a
+   premature stop). Naive: **3× slower** to the same answer (6 iters), **never stopped**
+   (burned the full budget on detours), and in run A **failed outright** — never tried
+   TP, found nothing, crashed the engine. Naive is **nondeterministic and unreliable**;
+   the harness is fast, monotone (no regression), and self-terminating.
+3. This reproduces the paper's Figure-18 story: the harness converges in a few
+   iterations and stops, while the naive agentic tuner wastes the budget (and can fail
+   to converge entirely).
+
+## Caveats
+
+- Compressed substrate (scale=0.5, out=128) → per-GPU numbers are *process* comparators,
+  not validated peak-rates; the convergence behavior is the result. (The TP4 optimum
+  reproduced at 0.340 across the harness run and naive run B, a useful consistency check.)
+- One run per arm per host; naive is nondeterministic (runs A and B differ markedly),
+  which is itself part of the finding. The harness arm's deterministic guided proposal
+  (TP4 at iter 2) and validator veto are reproducible.
+- Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the
+  cpfs and ran the completion. The codex `config.toml` points at a dash0-local proxy
+  (`127.0.0.1:11235`); on dash1 the LLM endpoint must be reached directly (set empty
+  `*_proxy` env) — see `scripts/run_naive_d1.sh`.