Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6 iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and crashed the engine. Refined conclusion (matches paper §7.3): a strong model can sometimes find the right knob unaided, so the harness's value is reliability + speed + stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4, no regression. Naive: 3x slower at best, no stop, failed at worst. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.6 KiB
Harness vs naive (use_harness on/off) — convergence ablation — 2026-06-16/17
Controlled ablation of the paper's "harness" (domain-knowledge knob-family steering):
the same agentic loop with llm.use_harness=true vs false (= the paper's naive
agentic tuner: free-form LLM proposals, no Harnesses: prompt section, no
deterministic guided proposals, no Stop-B validator/veto). Same workload, model, SLO,
substrate — the only difference is use_harness (configs
dash0_qwen27b_ablation_harness_on.json / ..._naive_off.json, verified to differ
only in that flag + study_id).
- Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount).
- Workload: chat 0–8k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%.
- Substrate (process comparison, not absolute peak-rate):
replay_time_scale=0.5,completion_tokens_override=128, Stop-A on,search.high=0.25, 6 probes, max-trials 6,--skip-baseline(the low-capacity TP1 auto-baseline is infeasible under this SLO+compression and would tripbaseline_all_infeasible; skipping it lets both loops climb from their first proposal). - This measures the tuning process (which knob family, convergence speed, stop discipline), not a validated peak-rate.
Harness ON — converged in 2 iterations, then stopped
| iter | proposer | config | per_gpu | outcome |
|---|---|---|---|---|
| 1 | LLM (harness-guided) | TP2 | 0.247 | feasible |
| 2 | harness (deterministic) | TP4 | 0.340 | feasible — incumbent |
| 3 | harness | TP4 + chunked-prefill + mbt=16384 | 0.333 | worse → rejected |
| (—) | LLM | should_stop |
— | VETOED ("decode TPOT still the bottleneck; adjacent probes weak") |
| 4 | LLM | TP2 + DP2 | 0.194 | worse → rejected |
| (—) | LLM | should_stop |
STOP | honored after veto budget |
Best TP4 @ 0.340; iters-to-best = 2; ran 4 trials then stopped (Stop-B + one veto of a premature stop); no regression.
Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst
The naive (free-form) gpt-5.4 loop behaved very differently across two runs — it has
no harness steering and no stop logic:
Run A (dash0, interrupted by an outage at trial-5): kept TP=1 the whole time and
cycled runtime knobs (max-num-batched-tokens 16k→65k, max-num-seqs, caching). All
trials infeasible (same tpot>50 + ttft>budget), trial-4 crashed the engine
(OOM at mbt=65536). Found no feasible config in 5 trials — never tried raising TP.
Run B (dash1, full budget):
| iter | config | per_gpu | note |
|---|---|---|---|
| 1 | TP2 | 0.247 | feasible |
| 2 | TP2 + max-num-seqs=32 | 0.218 | worse |
| 3 | TP2 + mbt=12288 | 0.218 | worse |
| 4 | TP2 (re-proposal) | 0.218 | no gain |
| 5 | TP2 + gpu-mem-util=0.85 | 0.218 | worse |
| 6 | TP4 | 0.340 | reaches the optimum — at the last trial |
Best TP4 @ 0.340 — the same optimum as the harness — but iters-to-best = 6, it used the entire budget with no early stop, and trials 2–5 were detours (TP2 + runtime tweaks, all worse than trial-1) before it stumbled onto TP4.
Comparison
| Harness ON | Naive OFF (B, dash1) | Naive OFF (A, dash0) | |
|---|---|---|---|
| best per-GPU | 0.340 (TP4) | 0.340 (TP4) | none (failed) |
| iters-to-best | 2 | 6 | — |
| trials used | 4 (stopped) | 6 (full budget, no stop) | 5 (interrupted) |
| stopped early? | yes (Stop-B + veto) | no | — |
| wasted trials | 2 (post-best refinements) | 4 (TP2+runtime detours) | 5 (runtime-only, infeasible) |
| path to optimum | direct (TP2→TP4) | slow (TP2→runtime detour→TP4) | wrong family (runtime on TP1) |
Interpretation (honest)
The bottleneck is compute (decode TPOT + prefill queueing), which only a compute-adding knob (tensor parallelism) fixes. Findings:
- A strong frontier model can sometimes find the right knob unaided — naive run B eventually reached TP4 = 0.34, the same optimum as the harness. This matches the paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for structured guidance. So the harness's value is not "naive always fails."
- The harness's value is reliability, speed, and stop discipline. With the harness: converged in 2 iters and stopped at 4 (recognized convergence; vetoed a premature stop). Naive: 3× slower to the same answer (6 iters), never stopped (burned the full budget on detours), and in run A failed outright — never tried TP, found nothing, crashed the engine. Naive is nondeterministic and unreliable; the harness is fast, monotone (no regression), and self-terminating.
- This reproduces the paper's Figure-18 story: the harness converges in a few iterations and stops, while the naive agentic tuner wastes the budget (and can fail to converge entirely).
Caveats
- Compressed substrate (scale=0.5, out=128) → per-GPU numbers are process comparators, not validated peak-rates; the convergence behavior is the result. (The TP4 optimum reproduced at 0.340 across the harness run and naive run B, a useful consistency check.)
- One run per arm per host; naive is nondeterministic (runs A and B differ markedly), which is itself part of the finding. The harness arm's deterministic guided proposal (TP4 at iter 2) and validator veto are reproducible.
- Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the
cpfs and ran the completion. The codex
config.tomlpoints at a dash0-local proxy (127.0.0.1:11235); on dash1 the LLM endpoint must be reached directly (set empty*_proxyenv) — seescripts/run_naive_d1.sh.