Files

Gahow Wang a1cbab0e69 Document harness-vs-naive ablation: setup, substrate calibration, blocker

Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B:
- both configs committed and validated on dash0 (differ only in
  use_harness + study_id), LLM auth + clean engine launch confirmed;
- characterizes exactly what the harness toggles (Harnesses: prompt
  section with ranked bottleneck hypotheses + knob-family steering,
  deterministic guided/stop proposals, Stop-B validator/veto) vs naive;
- substrate calibration from a real harness-ON run: at scale=0.2 the
  180s elapsed cap fires correctly but TP1 is uniformly infeasible even
  at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a
  real baseline; comparability caveat documented.

Honest status: full two-run sweep NOT completed in-session (~5-6
GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM
teardown re-validated). Includes a precise continuation recipe and the
scripts/ablation_trajectory.py helper (validated against a prior store).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 20:16:27 +08:00

9.0 KiB

Raw Blame History

Harness vs naive agentic tuner — controlled ablation on dense Qwen3.5-27B — 2026-06-16

Branch main. Goal: quantify the value of the paper's harness (domain-knowledge knob-family guidance) by running the agentic tuning loop twice on the same workload, identical in every respect except llm.use_harness.

Status: SET UP AND CALIBRATED; full two-run GPU sweep NOT completed in this session. The two ablation configs are committed and validated end-to-end on dash0 (LLM auth OK, engine launches clean, Stop-A/Stop-B machinery present). A substrate-calibration finding (below) was obtained from a real harness-ON run. The full sweep was not run to completion because each run is ~2–3 GPU-hours (8 trials × ~6-min engine warmup + multi-probe binary search) and the two runs are sequential (each may need 4 GPUs for TP4) — ~5–6 GPU-hours total, beyond this session. GPUs were left clean (all 0 MiB, no orphans). A precise continuation recipe is at the end.

What the ablation toggles (the harness mechanism, verified in code)

With use_harness=true vs false (src/aituner/llm.py, src/aituner/cli.py, src/aituner/harness.py):

Aspect	Harness ON	Naive OFF
Prompt `Harnesses:` section	present — ranked bottleneck hypotheses (`_rank_bottleneck_hypotheses`, weights TTFT/TPOT/queueing from probe failure counts + workload default) and per-knob-family use-when / procedure / guards with an `active_now` flag (e.g. TP family `active_now` when bottleneck ∈ {ttft_prefill, decode_tpot})	absent — only common preamble + study/SLO/trial-history JSON
Deterministic guided proposal	`build_harness_guided_proposal` can emit a deterministic first validation probe	none — LLM proposes freely every iteration
Stop-B authority	`_stop_authority`: an LLM `should_stop` is honored only if the deterministic validator agrees (frontier closed + no high-value candidate); else vetoed (bounded, `cli.py:350`)	LLM's own `should_stop` honored immediately (`stop_authority is None` ⇒ `authorized=True`)
Convergence guard	`_convergence_guard`: stop only when ≥3 completed trials are all within 2% of incumbent	not applied

The naive prompt still states TP/DP/EP are tunable and gives full context — so the only thing removed is the harness's bottleneck-diagnosis + knob-family steering (and the deterministic guided/stop scaffolding). That is exactly the paper's "naive agentic tuner."

Configs (committed, reproducible)

configs/examples/dash0_qwen27b_ablation_harness_on.json (study_id=dash0-qwen27b-ablation-harness-on, use_harness=true)
configs/examples/dash0_qwen27b_ablation_naive_off.json (study_id=dash0-qwen27b-ablation-naive-off, use_harness=false)

The two files differ in exactly two keys (llm.use_harness, study_id) — verified by diff of sorted JSON. Both validate on dash0 (codex/gpt-5.4 endpoint resolves, base config inherited from dash0_qwen27b_stopB_loop.json).

Substrate (and the calibration finding)

The ablation measures the tuning process, not absolute peak-rate, so a faster replay substrate is used (at replay_time_scale=1.0 a single TP4 trial took ~3 h — stop-b-e2e-27b-20260616.md).

knob	value	rationale
`trace.replay_time_scale`	0.2	`arrival_s = timestamp * time_scale` (`trace.py:223`): same request set arrives in 1/5 the wall-clock → ~5× effective offered load. The brief's prescribed lever (compress time, not just cut the cap).
`trace.early_stop_max_elapsed_s`	180 (from 900)	the 600 s arrival window compresses to 120 s at scale 0.2, so the inherited 900 s wall cap was ~5× too large and let overloaded probes burn ~15 min each. Scaled proportionately to the compressed time axis.
`trace.early_stop_max_lag_s`	30 (from 120)	proportionate to the 120 s compressed window.
`search.high`	0.25	sampling_u binary-search upper bound
`search.max_probes`	3 (from 5)	bound the binary-search step count per trial
`--max-trials`	8	iteration budget
Stop-A	enabled (unchanged)	converged-prefix replay truncation on for both runs
SLO	length-aware TTFT (4 s + L_in/8k) + TPOT ≤ 50 ms	unchanged
GPUs	`2,3,4,5,6,7`	GPUs 0/1 avoided

Calibration finding (real harness-ON run, trial-0001 baseline TP1): the first binary-search probe at sampling_u=0.125 measured pass_rate = 0.0 and early-stopped on probe_elapsed_s>180.0 (probe_history.json). So:

The 180 s elapsed cap works (cut the overloaded probe at 3 min, as intended).
At scale 0.2, TP1 cannot serve even the lightest binary-search threshold of this 0–8k chat window — it is hopelessly TPOT/decode-bound under 5× compression (engine logs: 260 preemptions over 311 requests, 100% GPU util, ≥12 reqs always queued). The baseline incumbent therefore sits at/near the search floor, leaving large headroom for TP scaling — a clean ablation shape, but every TP1 probe runs the full 180 s cap (no feasible point to find faster).

Substrate recommendation for the rerun (carried into the continuation recipe): scale 0.2 is too aggressive — it makes the whole TP1 family uniformly infeasible, so the baseline is uninformative and each probe pays the full elapsed cap. Use replay_time_scale=0.4–0.5 (window 240–300 s arrival) so TP1 registers a real feasible baseline and feasible probes finish before the cap; keep the caps proportionate (early_stop_max_elapsed_s = 900 × scale, early_stop_max_lag_s = 120 × scale).

Comparability caveat (applies to whatever scale is used). Compressed arrivals mean the absolute request_rate_per_gpu is not comparable to the scale=1.0 ground-truth climb (TP1 0.123 → TP2 0.29 → TP4 1.00). The ablation reads trajectory shape (which knob family each iteration tries; monotonic incumbent climb; where each run stops) and relative per-topology ordering.

Expected contrast (hypothesis to be confirmed — do not treat as result)

From the committed mechanism and the scale=1.0 27B climb (stop-b-e2e-27b-20260616.md) plus the older smoke-regime ablation (qwen27b-chat-0-8k-harness-fig18.md, iters-to-best 4→2) and a prior 235B naive run inspected this session (.aituner-prefill/...-noharness-minprompt-gpt54-20260514, which wandered into TP4+EP4 and TP4+DP2 launch-failures, repeated max-num-seqs/mbt runtime fiddling, and regressed at iters 6/8/9/11/12):

Harness ON should diagnose TP1 as TPOT/decode-bound (the tensor-parallel-size family active_now) and steer to TP↑ early, climbing TP1→TP2→TP4 with a monotonic incumbent, then pivoting to runtime tuning on the winning family, and stop only when the Stop-B convergence guard authorizes it.
Naive OFF is expected to wander (runtime knobs, EP/duplicate/launch-failing topologies) and possibly stop early on its own should_stop without a validator veto.

This is the quantity the rerun must measure; it is not yet measured here.

Continuation recipe (to finish the sweep)

On dash0, set replay_time_scale=0.4 (and early_stop_max_elapsed_s=360, early_stop_max_lag_s=48) in both ablation configs; keep max_probes=3, --max-trials 8, everything else identical. Re-verify the two configs differ only in use_harness+study_id.
export OPENAI_API_KEY=$(python3 -c "import json,pathlib;print(json.load(open(pathlib.Path.home()/'.codex/auth.json'))['OPENAI_API_KEY'])"); confirm curl .../v1/models → 200.
Run sequentially (GPUs 2–7 free between runs): setsid env PYTHONPATH=src OPENAI_API_KEY=$OPENAI_API_KEY python3 -u -m aituner.cli study tune --spec <cfg> --store-root .aituner-ablation --max-trials 8 </dev/null >logs/<name>.log 2>&1 &
Extract trajectories with the committed helper: python3 scripts/ablation_trajectory.py .aituner-ablation/<study_id> — it prints the iter → config → per_gpu → incumbent table and the proposal path (it distinguishes baseline-* / proposal-* / harness-proposal-* / harness-stop-*, so metrics #2 and #5 fall out directly).
Fill the five comparison metrics: (1) iters-to-best, (2) proposal path, (3) oscillation/regression, (4) wasted/infeasible/launch-failed trials, (5) whether/when each run stops (harness Stop-B vs naive's own should_stop).

Operational notes confirmed this session

LLM auth path works (export OPENAI_API_KEY from ~/.codex/auth.json; 200 from https://ai.prism.uno/v1/models). Both ON and OFF call the LLM.
GPUs 0/1 were clean (0 MiB) this session — the earlier leaked-memory orphans appear to have been reset; configs still pin GPUs 2–7.
SIGTERM teardown fix validated again: killing study tune tore down the engine + EngineCore workers cleanly, GPUs 2–7 returned to 0 MiB, no orphan.
Use setsid + </dev/null to fully detach the run from the (intermittently flaky) ssh session; poll state.json / trials/*/probe_history.json. The CLI buffers little to stdout — per-trial signal is in state.json; per-request signal is in each trial's engine.log.

9.0 KiB Raw Blame History Unescape Escape