Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic

Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-17 13:03:26 +08:00
parent 97d2ddabb1
commit 816765071f

View File

@@ -8,70 +8,92 @@ substrate — the only difference is `use_harness` (configs
`dash0_qwen27b_ablation_harness_on.json` / `..._naive_off.json`, verified to differ
only in that flag + study_id).
- Model/host: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (run on dash0; cpfs shared with dash1).
- Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount).
- Workload: chat 08k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%.
- Substrate (process comparison, not absolute peak-rate): `replay_time_scale=0.5`,
`completion_tokens_override=128`, Stop-A on, `search.high=0.25`, 6 probes, max-trials 6,
**`--skip-baseline`** (the low-capacity TP1 auto-baseline is infeasible under this
SLO+compression and would trip `baseline_all_infeasible`; skipping it lets both loops
climb from their first proposal).
- This measures the tuning *process* (which knob family, convergence), not validated
peak-rate.
- This measures the tuning *process* (which knob family, convergence speed, stop
discipline), not a validated peak-rate.
## Result
### Harness ON — converged to the right answer in 2 iterations
## Harness ON — converged in 2 iterations, then stopped
| iter | proposer | config | per_gpu | outcome |
| --- | --- | --- | --- | --- |
| 1 | LLM (harness-guided) | TP2 | 0.247 | feasible |
| 2 | harness (deterministic) | **TP4** | **0.340** | feasible — incumbent |
| 3 | harness | TP4 + chunked-prefill + mbt=16384 | 0.333 | worse → rejected |
| (—) | LLM | `should_stop` | — | **VETOED** by validator ("decode TPOT still the bottleneck; adjacent probes weak") |
| (—) | LLM | `should_stop` | — | **VETOED** ("decode TPOT still the bottleneck; adjacent probes weak") |
| 4 | LLM | TP2 + DP2 | 0.194 | worse → rejected |
| (—) | LLM | `should_stop` | STOP | honored (`llm_after_veto_budget`) |
| (—) | LLM | `should_stop` | **STOP** | honored after veto budget |
Incumbent **TP4 @ 0.340 req/s/GPU**; iters-to-best = 2; no regression (the two worse
refinements were correctly not adopted); the premature LLM stop was vetoed once, then
honored after the budget.
Best **TP4 @ 0.340**; iters-to-best = **2**; ran **4 trials then stopped** (Stop-B +
one veto of a premature stop); no regression.
### Naive OFF — wandered in the wrong knob family, never converged
| iter | config (TP never changed from 1) | outcome |
| --- | --- | --- |
| 1 | mbt=16384, seqs=128 | infeasible (`tpot>50`, `ttft>budget`) |
| 2 | mbt=32768, seqs=256, prefix-cache off, chunked | infeasible (same) |
| 3 | mbt=49152, seqs=384 | infeasible (same) |
| 4 | mbt=65536, seqs=512 | **FAILED** — engine crash (OOM at huge mbt) |
| 5 | mbt=57344, seqs=448 | interrupted by a dash0 outage |
## Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst
Incumbent **None** — no feasible config found in 5 trials. The naive LLM kept tuning
**runtime** knobs (batched-tokens / num-seqs / caching) and **never raised TP**.
The naive (free-form) `gpt-5.4` loop behaved very differently across two runs — it has
no harness steering and no stop logic:
## Interpretation (the headline)
**Run A (dash0, interrupted by an outage at trial-5):** kept **TP=1** the whole time and
cycled runtime knobs (`max-num-batched-tokens` 16k→65k, `max-num-seqs`, caching). All
trials **infeasible** (same `tpot>50` + `ttft>budget`), trial-4 **crashed the engine**
(OOM at mbt=65536). Found **no feasible config** in 5 trials — never tried raising TP.
The bottleneck here is **compute** (decode TPOT + prefill queueing). The harness
diagnosed it and steered straight to the knob family that adds compute — **tensor
parallelism** — reaching a feasible **TP4 @ 0.34 req/s/GPU in 2 iterations**, then
correctly rejecting weaker refinements and stopping. The naive tuner spent its whole
budget on **runtime knobs that cannot add compute**, never tried raising TP, found
**zero** feasible configs, and even crashed the engine. This is a clean, stark
quantification of the harness's value: **right-knob-family steering → fast convergence
+ no regression, vs aimless runtime wandering → non-convergence.** It reproduces the
paper's Figure-18 finding (harness converges in a few iters; the naive agentic tuner
wastes the budget).
**Run B (dash1, full budget):**
| iter | config | per_gpu | note |
| --- | --- | --- | --- |
| 1 | TP2 | 0.247 | feasible |
| 2 | TP2 + max-num-seqs=32 | 0.218 | worse |
| 3 | TP2 + mbt=12288 | 0.218 | worse |
| 4 | TP2 (re-proposal) | 0.218 | no gain |
| 5 | TP2 + gpu-mem-util=0.85 | 0.218 | worse |
| 6 | **TP4** | **0.340** | reaches the optimum — at the last trial |
## Caveats / honesty
Best **TP4 @ 0.340** — the *same* optimum as the harness — but iters-to-best = **6**,
it used the **entire budget with no early stop**, and trials 25 were detours (TP2 +
runtime tweaks, all worse than trial-1) before it stumbled onto TP4.
- Compressed substrate (scale=0.5, out=128) → the per-GPU numbers are *process*
comparators, not validated peak-rates; the **direction/convergence** is the result.
- The naive run was interrupted at trial-5 by a dash0 connectivity outage (not by the
experiment). The conclusion is already unambiguous (5 trials, never raised TP, all
infeasible / one crash). A confirmatory naive-to-completion was attempted on dash1
(shared cpfs, GPUs free) but **dash1 cannot reach the LLM gateway** (`Connection
refused` to the codex endpoint — dash0 has outbound access, dash1 does not), so the
LLM-driven loop only runs on dash0. The full-budget naive run can be completed when
dash0 is back; it will not change the conclusion.
- LLM nondeterminism: a single run per arm. The harness arm's deterministic guided
proposals (TP4 at iter 2) and validator veto are reproducible; the naive arm's exact
path varies but its failure mode (wrong knob family, no TP) is the expected one.
- dash0/dash1 share the cpfs mount, so the run artifacts under `.aituner/abl-*` are
readable/continuable from either host.
## Comparison
| | Harness ON | Naive OFF (B, dash1) | Naive OFF (A, dash0) |
| --- | --- | --- | --- |
| best per-GPU | 0.340 (TP4) | 0.340 (TP4) | none (failed) |
| iters-to-best | **2** | 6 | — |
| trials used | **4 (stopped)** | 6 (full budget, no stop) | 5 (interrupted) |
| stopped early? | yes (Stop-B + veto) | no | — |
| wasted trials | 2 (post-best refinements) | 4 (TP2+runtime detours) | 5 (runtime-only, infeasible) |
| path to optimum | direct (TP2→TP4) | slow (TP2→runtime detour→TP4) | wrong family (runtime on TP1) |
## Interpretation (honest)
The bottleneck is **compute** (decode TPOT + prefill queueing), which only a
compute-adding knob (**tensor parallelism**) fixes. Findings:
1. **A strong frontier model can sometimes find the right knob unaided** — naive run B
eventually reached TP4 = 0.34, the same optimum as the harness. This matches the
paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for
structured guidance. So the harness's value is **not** "naive always fails."
2. **The harness's value is reliability, speed, and stop discipline.** With the harness:
converged in **2 iters** and **stopped at 4** (recognized convergence; vetoed a
premature stop). Naive: **3× slower** to the same answer (6 iters), **never stopped**
(burned the full budget on detours), and in run A **failed outright** — never tried
TP, found nothing, crashed the engine. Naive is **nondeterministic and unreliable**;
the harness is fast, monotone (no regression), and self-terminating.
3. This reproduces the paper's Figure-18 story: the harness converges in a few
iterations and stops, while the naive agentic tuner wastes the budget (and can fail
to converge entirely).
## Caveats
- Compressed substrate (scale=0.5, out=128) → per-GPU numbers are *process* comparators,
not validated peak-rates; the convergence behavior is the result. (The TP4 optimum
reproduced at 0.340 across the harness run and naive run B, a useful consistency check.)
- One run per arm per host; naive is nondeterministic (runs A and B differ markedly),
which is itself part of the finding. The harness arm's deterministic guided proposal
(TP4 at iter 2) and validator veto are reproducible.
- Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the
cpfs and ran the completion. The codex `config.toml` points at a dash0-local proxy
(`127.0.0.1:11235`); on dash1 the LLM endpoint must be reached directly (set empty
`*_proxy` env) — see `scripts/run_naive_d1.sh`.