Document harness-vs-naive ablation: setup, substrate calibration, blocker

Sets up the controlled use_harness ON-vs-OFF ablation on dense 27B: - both configs committed and validated on dash0 (differ only in use_harness + study_id), LLM auth + clean engine launch confirmed; - characterizes exactly what the harness toggles (Harnesses: prompt section with ranked bottleneck hypotheses + knob-family steering, deterministic guided/stop proposals, Stop-B validator/veto) vs naive; - substrate calibration from a real harness-ON run: at scale=0.2 the 180s elapsed cap fires correctly but TP1 is uniformly infeasible even at u=0.125 (pass=0, elapsed-capped) -> recommend scale 0.4-0.5 for a real baseline; comparability caveat documented. Honest status: full two-run sweep NOT completed in-session (~5-6 GPU-hours, sequential); GPUs left clean (all 0 MiB, no orphans; SIGTERM teardown re-validated). Includes a precise continuation recipe and the scripts/ablation_trajectory.py helper (validated against a prior store). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 20:16:27 +08:00
parent 0794efa249
commit a1cbab0e69
1 changed files with 116 additions and 41 deletions
--- a/docs/harness-ablation/harness-vs-naive-20260616.md
+++ b/docs/harness-ablation/harness-vs-naive-20260616.md
@@ -1,62 +1,137 @@
 # Harness vs naive agentic tuner — controlled ablation on dense Qwen3.5-27B — 2026-06-16

-Branch `main`. Quantifies the value of the paper's **harness** (domain-knowledge
+Branch `main`. Goal: quantify the value of the paper's **harness** (domain-knowledge
 knob-family guidance) by running the agentic tuning loop twice on the *same*
-workload, identical in every respect except `llm.use_harness`:
+workload, identical in every respect except `llm.use_harness`.

- **Harness ON** (`dash0_qwen27b_ablation_harness_on.json`, study
-  `dash0-qwen27b-ablation-harness-on`): the prompt carries the `Harnesses:`
-  section (ranked bottleneck hypotheses + per-knob-family use-when / procedure /
-  guards, with an `active_now` flag), the loop can emit a deterministic
-  harness-guided first probe, and a **Stop-B validator** gates the LLM's
-  `should_stop` (an unauthorized stop is vetoed).
- **Naive OFF** (`dash0_qwen27b_ablation_naive_off.json`, study
-  `dash0-qwen27b-ablation-naive-off`): `use_harness=false`. No harness prompt
-  section, no deterministic guided/stop proposals, and the LLM's own `should_stop`
-  is honored without a validator veto. The prompt still tells the LLM that
-  TP/DP/EP are tunable and gives the full study/SLO/trial-history context — so the
-  difference is purely the harness guidance, this is the paper's "naive agentic
-  tuner."
+> **Status: SET UP AND CALIBRATED; full two-run GPU sweep NOT completed in this
+> session.** The two ablation configs are committed and validated end-to-end on
+> dash0 (LLM auth OK, engine launches clean, Stop-A/Stop-B machinery present). A
+> substrate-calibration finding (below) was obtained from a real harness-ON run.
+> The full sweep was **not** run to completion because each run is ~2–3 GPU-hours
+> (8 trials × ~6-min engine warmup + multi-probe binary search) and the two runs
+> are sequential (each may need 4 GPUs for TP4) — ~5–6 GPU-hours total, beyond
+> this session. GPUs were left **clean (all 0 MiB, no orphans)**. A precise
+> continuation recipe is at the end.

-The two config files differ in **exactly two keys** (`llm.use_harness` and
-`study_id`); verified by diff.
+## What the ablation toggles (the harness mechanism, verified in code)

-## Substrate (why these knobs, and the comparability caveat)
+With `use_harness=true` vs `false` (`src/aituner/llm.py`, `src/aituner/cli.py`,
+`src/aituner/harness.py`):

-This ablation measures the **tuning process** (proposal path + convergence), not
-absolute peak-rate, so a faster replay substrate is used to keep it tractable
-(at `replay_time_scale=1.0` a single TP4 trial took ~3 h — see
+| Aspect | Harness ON | Naive OFF |
+| --- | --- | --- |
+| Prompt `Harnesses:` section | **present** — ranked bottleneck hypotheses (`_rank_bottleneck_hypotheses`, weights TTFT/TPOT/queueing from probe failure counts + workload default) and per-knob-family **use-when / procedure / guards** with an `active_now` flag (e.g. TP family `active_now` when bottleneck ∈ {ttft_prefill, decode_tpot}) | **absent** — only common preamble + study/SLO/trial-history JSON |
+| Deterministic guided proposal | `build_harness_guided_proposal` can emit a deterministic first validation probe | none — LLM proposes freely every iteration |
+| Stop-B authority | `_stop_authority`: an LLM `should_stop` is honored only if the deterministic validator agrees (frontier closed + no high-value candidate); else **vetoed** (bounded, `cli.py:350`) | LLM's own `should_stop` honored immediately (`stop_authority is None` ⇒ `authorized=True`) |
+| Convergence guard | `_convergence_guard`: stop only when ≥3 completed trials are all within 2% of incumbent | not applied |
+
+The naive prompt still states TP/DP/EP are tunable and gives full context — so the
+**only** thing removed is the harness's bottleneck-diagnosis + knob-family steering
+(and the deterministic guided/stop scaffolding). That is exactly the paper's
+"naive agentic tuner."
+
+## Configs (committed, reproducible)
+
+- `configs/examples/dash0_qwen27b_ablation_harness_on.json`
+  (`study_id=dash0-qwen27b-ablation-harness-on`, `use_harness=true`)
+- `configs/examples/dash0_qwen27b_ablation_naive_off.json`
+  (`study_id=dash0-qwen27b-ablation-naive-off`, `use_harness=false`)
+
+The two files differ in **exactly two keys** (`llm.use_harness`, `study_id`) —
+verified by `diff` of sorted JSON. Both validate on dash0 (codex/gpt-5.4 endpoint
+resolves, base config inherited from `dash0_qwen27b_stopB_loop.json`).
+
+## Substrate (and the calibration finding)
+
+The ablation measures the **tuning process**, not absolute peak-rate, so a faster
+replay substrate is used (at `replay_time_scale=1.0` a single TP4 trial took ~3 h —
 `stop-b-e2e-27b-20260616.md`).

 | knob | value | rationale |
 | --- | --- | --- |
-| `trace.replay_time_scale` | **0.2** | arrival times are multiplied by 0.2, i.e. the same request set arrives in 1/5 the wall-clock → ~5× higher effective offered load. `arrival_s = timestamp * time_scale` (`trace.py:223`). Mild arrival-time compression: the lever the brief prescribes (compress time, do **not** just cut the elapsed cap). |
-| `search.high` | 0.25 | upper bound of the sampling_u binary search |
-| `search.max_probes` | 5 | probe budget per trial |
+| `trace.replay_time_scale` | **0.2** | `arrival_s = timestamp * time_scale` (`trace.py:223`): same request set arrives in 1/5 the wall-clock → ~5× effective offered load. The brief's prescribed lever (compress time, not just cut the cap). |
+| `trace.early_stop_max_elapsed_s` | **180** (from 900) | the 600 s arrival window compresses to **120 s** at scale 0.2, so the inherited 900 s wall cap was ~5× too large and let overloaded probes burn ~15 min each. Scaled proportionately to the compressed time axis. |
+| `trace.early_stop_max_lag_s` | **30** (from 120) | proportionate to the 120 s compressed window. |
+| `search.high` | 0.25 | sampling_u binary-search upper bound |
+| `search.max_probes` | **3** (from 5) | bound the binary-search step count per trial |
 | `--max-trials` | 8 | iteration budget |
-| Stop-A | **enabled** (unchanged) | converged-prefix replay truncation stays on for both runs |
-| SLO | length-aware TTFT (4s + L_in/8k) + TPOT ≤ 50 ms | unchanged from base |
-| GPUs | `CUDA_VISIBLE_DEVICES=2,3,4,5,6,7` | GPUs 0/1 avoided |
+| Stop-A | **enabled** (unchanged) | converged-prefix replay truncation on for both runs |
+| SLO | length-aware TTFT (4 s + L_in/8k) + TPOT ≤ 50 ms | unchanged |
+| GPUs | `2,3,4,5,6,7` | GPUs 0/1 avoided |

-**Comparability caveat.** Because arrival times are compressed 5×, the absolute
-`request_rate_per_gpu` values are **not** comparable to the scale=1.0 ground-truth
-climb (TP1 0.123 → TP2 0.29 → TP4 1.00). The ablation reads the **trajectory
-shape** (which knob family each iteration tries, whether the incumbent climbs
-monotonically, where each run stops) and the **relative** per-GPU ordering across
-topologies — not the absolute numbers.
+**Calibration finding (real harness-ON run, trial-0001 baseline TP1):** the first
+binary-search probe at `sampling_u=0.125` measured **pass_rate = 0.0** and
+early-stopped on **`probe_elapsed_s>180.0`** (probe_history.json). So:
+1. The 180 s elapsed cap **works** (cut the overloaded probe at 3 min, as intended).
+2. At scale 0.2, **TP1 cannot serve even the lightest binary-search threshold** of
+   this 0–8k chat window — it is hopelessly TPOT/decode-bound under 5× compression
+   (engine logs: 260 preemptions over 311 requests, 100% GPU util, ≥12 reqs always
+   queued). The baseline incumbent therefore sits at/near the search floor, leaving
+   large headroom for TP scaling — a *clean* ablation shape, but every TP1 probe
+   runs the full 180 s cap (no feasible point to find faster).

-## Run 1 — Harness ON
+**Substrate recommendation for the rerun (carried into the continuation recipe):**
+scale 0.2 is *too* aggressive — it makes the whole TP1 family uniformly infeasible,
+so the baseline is uninformative and each probe pays the full elapsed cap.
+Use **`replay_time_scale=0.4–0.5`** (window 240–300 s arrival) so TP1 registers a
+real feasible baseline and feasible probes finish *before* the cap; keep the caps
+proportionate (`early_stop_max_elapsed_s = 900 × scale`, `early_stop_max_lag_s =
+120 × scale`).

-<!-- TRAJECTORY_ON -->
+**Comparability caveat (applies to whatever scale is used).** Compressed arrivals
+mean the absolute `request_rate_per_gpu` is **not** comparable to the scale=1.0
+ground-truth climb (TP1 0.123 → TP2 0.29 → TP4 1.00). The ablation reads
+**trajectory shape** (which knob family each iteration tries; monotonic incumbent
+climb; where each run stops) and **relative** per-topology ordering.

-## Run 2 — Naive OFF
+## Expected contrast (hypothesis to be confirmed — do not treat as result)

-<!-- TRAJECTORY_OFF -->
+From the committed mechanism and the scale=1.0 27B climb (`stop-b-e2e-27b-20260616.md`)
+plus the older smoke-regime ablation (`qwen27b-chat-0-8k-harness-fig18.md`,
+iters-to-best 4→2) and a prior 235B **naive** run inspected this session
+(`.aituner-prefill/...-noharness-minprompt-gpt54-20260514`, which wandered into
+TP4+EP4 and TP4+DP2 launch-failures, repeated max-num-seqs/mbt runtime fiddling,
+and regressed at iters 6/8/9/11/12):

-## The five comparison metrics
+- **Harness ON** should diagnose TP1 as TPOT/decode-bound (the `tensor-parallel-size`
+  family `active_now`) and steer to **TP↑ early**, climbing TP1→TP2→TP4 with a
+  monotonic incumbent, then pivoting to runtime tuning on the winning family, and
+  stop only when the Stop-B convergence guard authorizes it.
+- **Naive OFF** is expected to **wander** (runtime knobs, EP/duplicate/launch-failing
+  topologies) and possibly stop early on its own `should_stop` without a validator
+  veto.

-<!-- METRICS -->
+This is the quantity the rerun must measure; it is **not** yet measured here.

-## Analysis & caveats
+## Continuation recipe (to finish the sweep)

-<!-- ANALYSIS -->
+1. On dash0, set `replay_time_scale=0.4` (and `early_stop_max_elapsed_s=360`,
+   `early_stop_max_lag_s=48`) in **both** ablation configs; keep `max_probes=3`,
+   `--max-trials 8`, everything else identical. Re-verify the two configs differ
+   only in `use_harness`+`study_id`.
+2. `export OPENAI_API_KEY=$(python3 -c "import json,pathlib;print(json.load(open(pathlib.Path.home()/'.codex/auth.json'))['OPENAI_API_KEY'])")`;
+   confirm `curl .../v1/models` → 200.
+3. Run **sequentially** (GPUs 2–7 free between runs):
+   `setsid env PYTHONPATH=src OPENAI_API_KEY=$OPENAI_API_KEY python3 -u -m aituner.cli study tune --spec <cfg> --store-root .aituner-ablation --max-trials 8 </dev/null >logs/<name>.log 2>&1 &`
+4. Extract trajectories with the committed helper:
+   `python3 scripts/ablation_trajectory.py .aituner-ablation/<study_id>` — it prints
+   the iter → config → per_gpu → incumbent table and the proposal path (it
+   distinguishes `baseline-*` / `proposal-*` / `harness-proposal-*` / `harness-stop-*`,
+   so metrics #2 and #5 fall out directly).
+5. Fill the five comparison metrics: (1) iters-to-best, (2) proposal path,
+   (3) oscillation/regression, (4) wasted/infeasible/launch-failed trials,
+   (5) whether/when each run stops (harness Stop-B vs naive's own `should_stop`).
+
+## Operational notes confirmed this session
+
+- LLM auth path works (export `OPENAI_API_KEY` from `~/.codex/auth.json`; 200 from
+  `https://ai.prism.uno/v1/models`). Both ON and OFF call the LLM.
+- GPUs 0/1 were **clean** (0 MiB) this session — the earlier leaked-memory orphans
+  appear to have been reset; configs still pin GPUs 2–7.
+- **SIGTERM teardown fix validated again**: killing `study tune` tore down the
+  engine + EngineCore workers cleanly, GPUs 2–7 returned to 0 MiB, no orphan.
+- Use `setsid` + `</dev/null` to fully detach the run from the (intermittently
+  flaky) ssh session; poll `state.json` / `trials/*/probe_history.json`. The CLI
+  buffers little to stdout — per-trial signal is in `state.json`; per-request
+  signal is in each trial's `engine.log`.