Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -8,70 +8,92 @@ substrate — the only difference is `use_harness` (configs
|
||||
`dash0_qwen27b_ablation_harness_on.json` / `..._naive_off.json`, verified to differ
|
||||
only in that flag + study_id).
|
||||
|
||||
- Model/host: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (run on dash0; cpfs shared with dash1).
|
||||
- Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount).
|
||||
- Workload: chat 0–8k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%.
|
||||
- Substrate (process comparison, not absolute peak-rate): `replay_time_scale=0.5`,
|
||||
`completion_tokens_override=128`, Stop-A on, `search.high=0.25`, 6 probes, max-trials 6,
|
||||
**`--skip-baseline`** (the low-capacity TP1 auto-baseline is infeasible under this
|
||||
SLO+compression and would trip `baseline_all_infeasible`; skipping it lets both loops
|
||||
climb from their first proposal).
|
||||
- This measures the tuning *process* (which knob family, convergence), not validated
|
||||
peak-rate.
|
||||
- This measures the tuning *process* (which knob family, convergence speed, stop
|
||||
discipline), not a validated peak-rate.
|
||||
|
||||
## Result
|
||||
|
||||
### Harness ON — converged to the right answer in 2 iterations
|
||||
## Harness ON — converged in 2 iterations, then stopped
|
||||
| iter | proposer | config | per_gpu | outcome |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| 1 | LLM (harness-guided) | TP2 | 0.247 | feasible |
|
||||
| 2 | harness (deterministic) | **TP4** | **0.340** | feasible — incumbent |
|
||||
| 3 | harness | TP4 + chunked-prefill + mbt=16384 | 0.333 | worse → rejected |
|
||||
| (—) | LLM | `should_stop` | — | **VETOED** by validator ("decode TPOT still the bottleneck; adjacent probes weak") |
|
||||
| (—) | LLM | `should_stop` | — | **VETOED** ("decode TPOT still the bottleneck; adjacent probes weak") |
|
||||
| 4 | LLM | TP2 + DP2 | 0.194 | worse → rejected |
|
||||
| (—) | LLM | `should_stop` | STOP | honored (`llm_after_veto_budget`) |
|
||||
| (—) | LLM | `should_stop` | **STOP** | honored after veto budget |
|
||||
|
||||
Incumbent **TP4 @ 0.340 req/s/GPU**; iters-to-best = 2; no regression (the two worse
|
||||
refinements were correctly not adopted); the premature LLM stop was vetoed once, then
|
||||
honored after the budget.
|
||||
Best **TP4 @ 0.340**; iters-to-best = **2**; ran **4 trials then stopped** (Stop-B +
|
||||
one veto of a premature stop); no regression.
|
||||
|
||||
### Naive OFF — wandered in the wrong knob family, never converged
|
||||
| iter | config (TP never changed from 1) | outcome |
|
||||
| --- | --- | --- |
|
||||
| 1 | mbt=16384, seqs=128 | infeasible (`tpot>50`, `ttft>budget`) |
|
||||
| 2 | mbt=32768, seqs=256, prefix-cache off, chunked | infeasible (same) |
|
||||
| 3 | mbt=49152, seqs=384 | infeasible (same) |
|
||||
| 4 | mbt=65536, seqs=512 | **FAILED** — engine crash (OOM at huge mbt) |
|
||||
| 5 | mbt=57344, seqs=448 | interrupted by a dash0 outage |
|
||||
## Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst
|
||||
|
||||
Incumbent **None** — no feasible config found in 5 trials. The naive LLM kept tuning
|
||||
**runtime** knobs (batched-tokens / num-seqs / caching) and **never raised TP**.
|
||||
The naive (free-form) `gpt-5.4` loop behaved very differently across two runs — it has
|
||||
no harness steering and no stop logic:
|
||||
|
||||
## Interpretation (the headline)
|
||||
**Run A (dash0, interrupted by an outage at trial-5):** kept **TP=1** the whole time and
|
||||
cycled runtime knobs (`max-num-batched-tokens` 16k→65k, `max-num-seqs`, caching). All
|
||||
trials **infeasible** (same `tpot>50` + `ttft>budget`), trial-4 **crashed the engine**
|
||||
(OOM at mbt=65536). Found **no feasible config** in 5 trials — never tried raising TP.
|
||||
|
||||
The bottleneck here is **compute** (decode TPOT + prefill queueing). The harness
|
||||
diagnosed it and steered straight to the knob family that adds compute — **tensor
|
||||
parallelism** — reaching a feasible **TP4 @ 0.34 req/s/GPU in 2 iterations**, then
|
||||
correctly rejecting weaker refinements and stopping. The naive tuner spent its whole
|
||||
budget on **runtime knobs that cannot add compute**, never tried raising TP, found
|
||||
**zero** feasible configs, and even crashed the engine. This is a clean, stark
|
||||
quantification of the harness's value: **right-knob-family steering → fast convergence
|
||||
+ no regression, vs aimless runtime wandering → non-convergence.** It reproduces the
|
||||
paper's Figure-18 finding (harness converges in a few iters; the naive agentic tuner
|
||||
wastes the budget).
|
||||
**Run B (dash1, full budget):**
|
||||
| iter | config | per_gpu | note |
|
||||
| --- | --- | --- | --- |
|
||||
| 1 | TP2 | 0.247 | feasible |
|
||||
| 2 | TP2 + max-num-seqs=32 | 0.218 | worse |
|
||||
| 3 | TP2 + mbt=12288 | 0.218 | worse |
|
||||
| 4 | TP2 (re-proposal) | 0.218 | no gain |
|
||||
| 5 | TP2 + gpu-mem-util=0.85 | 0.218 | worse |
|
||||
| 6 | **TP4** | **0.340** | reaches the optimum — at the last trial |
|
||||
|
||||
## Caveats / honesty
|
||||
Best **TP4 @ 0.340** — the *same* optimum as the harness — but iters-to-best = **6**,
|
||||
it used the **entire budget with no early stop**, and trials 2–5 were detours (TP2 +
|
||||
runtime tweaks, all worse than trial-1) before it stumbled onto TP4.
|
||||
|
||||
- Compressed substrate (scale=0.5, out=128) → the per-GPU numbers are *process*
|
||||
comparators, not validated peak-rates; the **direction/convergence** is the result.
|
||||
- The naive run was interrupted at trial-5 by a dash0 connectivity outage (not by the
|
||||
experiment). The conclusion is already unambiguous (5 trials, never raised TP, all
|
||||
infeasible / one crash). A confirmatory naive-to-completion was attempted on dash1
|
||||
(shared cpfs, GPUs free) but **dash1 cannot reach the LLM gateway** (`Connection
|
||||
refused` to the codex endpoint — dash0 has outbound access, dash1 does not), so the
|
||||
LLM-driven loop only runs on dash0. The full-budget naive run can be completed when
|
||||
dash0 is back; it will not change the conclusion.
|
||||
- LLM nondeterminism: a single run per arm. The harness arm's deterministic guided
|
||||
proposals (TP4 at iter 2) and validator veto are reproducible; the naive arm's exact
|
||||
path varies but its failure mode (wrong knob family, no TP) is the expected one.
|
||||
- dash0/dash1 share the cpfs mount, so the run artifacts under `.aituner/abl-*` are
|
||||
readable/continuable from either host.
|
||||
## Comparison
|
||||
|
||||
| | Harness ON | Naive OFF (B, dash1) | Naive OFF (A, dash0) |
|
||||
| --- | --- | --- | --- |
|
||||
| best per-GPU | 0.340 (TP4) | 0.340 (TP4) | none (failed) |
|
||||
| iters-to-best | **2** | 6 | — |
|
||||
| trials used | **4 (stopped)** | 6 (full budget, no stop) | 5 (interrupted) |
|
||||
| stopped early? | yes (Stop-B + veto) | no | — |
|
||||
| wasted trials | 2 (post-best refinements) | 4 (TP2+runtime detours) | 5 (runtime-only, infeasible) |
|
||||
| path to optimum | direct (TP2→TP4) | slow (TP2→runtime detour→TP4) | wrong family (runtime on TP1) |
|
||||
|
||||
## Interpretation (honest)
|
||||
|
||||
The bottleneck is **compute** (decode TPOT + prefill queueing), which only a
|
||||
compute-adding knob (**tensor parallelism**) fixes. Findings:
|
||||
|
||||
1. **A strong frontier model can sometimes find the right knob unaided** — naive run B
|
||||
eventually reached TP4 = 0.34, the same optimum as the harness. This matches the
|
||||
paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for
|
||||
structured guidance. So the harness's value is **not** "naive always fails."
|
||||
2. **The harness's value is reliability, speed, and stop discipline.** With the harness:
|
||||
converged in **2 iters** and **stopped at 4** (recognized convergence; vetoed a
|
||||
premature stop). Naive: **3× slower** to the same answer (6 iters), **never stopped**
|
||||
(burned the full budget on detours), and in run A **failed outright** — never tried
|
||||
TP, found nothing, crashed the engine. Naive is **nondeterministic and unreliable**;
|
||||
the harness is fast, monotone (no regression), and self-terminating.
|
||||
3. This reproduces the paper's Figure-18 story: the harness converges in a few
|
||||
iterations and stops, while the naive agentic tuner wastes the budget (and can fail
|
||||
to converge entirely).
|
||||
|
||||
## Caveats
|
||||
|
||||
- Compressed substrate (scale=0.5, out=128) → per-GPU numbers are *process* comparators,
|
||||
not validated peak-rates; the convergence behavior is the result. (The TP4 optimum
|
||||
reproduced at 0.340 across the harness run and naive run B, a useful consistency check.)
|
||||
- One run per arm per host; naive is nondeterministic (runs A and B differ markedly),
|
||||
which is itself part of the finding. The harness arm's deterministic guided proposal
|
||||
(TP4 at iter 2) and validator veto are reproducible.
|
||||
- Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the
|
||||
cpfs and ran the completion. The codex `config.toml` points at a dash0-local proxy
|
||||
(`127.0.0.1:11235`); on dash1 the LLM endpoint must be reached directly (set empty
|
||||
`*_proxy` env) — see `scripts/run_naive_d1.sh`.
|
||||
|
||||
Reference in New Issue
Block a user