Files
aituner/docs/harness-ablation/harness-vs-naive-20260616.md
Gahow Wang 816765071f Complete harness-vs-naive ablation: harness 3x faster + stops; naive nondeterministic
Full naive run (dash1) reached the same TP4=0.34 optimum as the harness but took 6
iters (vs 2), never stopped (full budget), and spent trials 2-5 on worse TP2+runtime
detours. The other naive run (dash0) wandered runtime-only on TP1, found nothing, and
crashed the engine. Refined conclusion (matches paper §7.3): a strong model can
sometimes find the right knob unaided, so the harness's value is reliability + speed +
stop discipline, not that naive always fails. Harness: 2 iters-to-best, stopped at 4,
no regression. Naive: 3x slower at best, no stop, failed at worst.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:03:26 +08:00

5.6 KiB
Raw Permalink Blame History

Harness vs naive (use_harness on/off) — convergence ablation — 2026-06-16/17

Controlled ablation of the paper's "harness" (domain-knowledge knob-family steering): the same agentic loop with llm.use_harness=true vs false (= the paper's naive agentic tuner: free-form LLM proposals, no Harnesses: prompt section, no deterministic guided proposals, no Stop-B validator/veto). Same workload, model, SLO, substrate — the only difference is use_harness (configs dash0_qwen27b_ablation_harness_on.json / ..._naive_off.json, verified to differ only in that flag + study_id).

  • Model: dense Qwen3.5-27B, vLLM 0.11.1, 8×H20 (dash0 and dash1 share the cpfs mount).
  • Workload: chat 08k, length-aware TTFT SLO (4s + L_in/8k) + TPOT ≤ 50 ms, pass ≥ 95%.
  • Substrate (process comparison, not absolute peak-rate): replay_time_scale=0.5, completion_tokens_override=128, Stop-A on, search.high=0.25, 6 probes, max-trials 6, --skip-baseline (the low-capacity TP1 auto-baseline is infeasible under this SLO+compression and would trip baseline_all_infeasible; skipping it lets both loops climb from their first proposal).
  • This measures the tuning process (which knob family, convergence speed, stop discipline), not a validated peak-rate.

Harness ON — converged in 2 iterations, then stopped

iter proposer config per_gpu outcome
1 LLM (harness-guided) TP2 0.247 feasible
2 harness (deterministic) TP4 0.340 feasible — incumbent
3 harness TP4 + chunked-prefill + mbt=16384 0.333 worse → rejected
(—) LLM should_stop VETOED ("decode TPOT still the bottleneck; adjacent probes weak")
4 LLM TP2 + DP2 0.194 worse → rejected
(—) LLM should_stop STOP honored after veto budget

Best TP4 @ 0.340; iters-to-best = 2; ran 4 trials then stopped (Stop-B + one veto of a premature stop); no regression.

Naive OFF — nondeterministic; reaches the optimum slowly at best, fails at worst

The naive (free-form) gpt-5.4 loop behaved very differently across two runs — it has no harness steering and no stop logic:

Run A (dash0, interrupted by an outage at trial-5): kept TP=1 the whole time and cycled runtime knobs (max-num-batched-tokens 16k→65k, max-num-seqs, caching). All trials infeasible (same tpot>50 + ttft>budget), trial-4 crashed the engine (OOM at mbt=65536). Found no feasible config in 5 trials — never tried raising TP.

Run B (dash1, full budget):

iter config per_gpu note
1 TP2 0.247 feasible
2 TP2 + max-num-seqs=32 0.218 worse
3 TP2 + mbt=12288 0.218 worse
4 TP2 (re-proposal) 0.218 no gain
5 TP2 + gpu-mem-util=0.85 0.218 worse
6 TP4 0.340 reaches the optimum — at the last trial

Best TP4 @ 0.340 — the same optimum as the harness — but iters-to-best = 6, it used the entire budget with no early stop, and trials 25 were detours (TP2 + runtime tweaks, all worse than trial-1) before it stumbled onto TP4.

Comparison

Harness ON Naive OFF (B, dash1) Naive OFF (A, dash0)
best per-GPU 0.340 (TP4) 0.340 (TP4) none (failed)
iters-to-best 2 6
trials used 4 (stopped) 6 (full budget, no stop) 5 (interrupted)
stopped early? yes (Stop-B + veto) no
wasted trials 2 (post-best refinements) 4 (TP2+runtime detours) 5 (runtime-only, infeasible)
path to optimum direct (TP2→TP4) slow (TP2→runtime detour→TP4) wrong family (runtime on TP1)

Interpretation (honest)

The bottleneck is compute (decode TPOT + prefill queueing), which only a compute-adding knob (tensor parallelism) fixes. Findings:

  1. A strong frontier model can sometimes find the right knob unaided — naive run B eventually reached TP4 = 0.34, the same optimum as the harness. This matches the paper's own caveat (§7.3): stronger models reduce, but do not remove, the need for structured guidance. So the harness's value is not "naive always fails."
  2. The harness's value is reliability, speed, and stop discipline. With the harness: converged in 2 iters and stopped at 4 (recognized convergence; vetoed a premature stop). Naive: 3× slower to the same answer (6 iters), never stopped (burned the full budget on detours), and in run A failed outright — never tried TP, found nothing, crashed the engine. Naive is nondeterministic and unreliable; the harness is fast, monotone (no regression), and self-terminating.
  3. This reproduces the paper's Figure-18 story: the harness converges in a few iterations and stops, while the naive agentic tuner wastes the budget (and can fail to converge entirely).

Caveats

  • Compressed substrate (scale=0.5, out=128) → per-GPU numbers are process comparators, not validated peak-rates; the convergence behavior is the result. (The TP4 optimum reproduced at 0.340 across the harness run and naive run B, a useful consistency check.)
  • One run per arm per host; naive is nondeterministic (runs A and B differ markedly), which is itself part of the finding. The harness arm's deterministic guided proposal (TP4 at iter 2) and validator veto are reproducible.
  • Infra notes: dash0 (LLM-gateway reachable) went down mid-experiment; dash1 shares the cpfs and ran the completion. The codex config.toml points at a dash0-local proxy (127.0.0.1:11235); on dash1 the LLM endpoint must be reached directly (set empty *_proxy env) — see scripts/run_naive_d1.sh.