Scale ablation early-stop caps to the compressed window (scale=0.2)

At replay_time_scale=0.2 the 600s arrival window compresses to 120s, so the inherited 900s wall-clock elapsed cap let overloaded TP1 probes burn ~15min each (the tractability hazard the brief flagged). Scale the caps proportionately to the time axis: early_stop_max_elapsed_s 900->180, early_stop_max_lag_s 120->30. Feasible probes (~120s arrival + drain) finish well inside 180s; overloaded probes die in ~3min. Both configs still differ only in use_harness + study_id. Adds the ablation doc skeleton and a read-only trajectory-extraction helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:49:57 +08:00
parent a16016a876
commit d975e57bb5
4 changed files with 137 additions and 4 deletions
--- a/configs/examples/dash0_qwen27b_ablation_harness_on.json
+++ b/configs/examples/dash0_qwen27b_ablation_harness_on.json
@@ -131,8 +131,8 @@
      "max_input_tokens": 8192
    },
    "replay_time_scale": 0.2,
-    "early_stop_max_lag_s": 120.0,
+    "early_stop_max_lag_s": 30.0,
-    "early_stop_max_elapsed_s": 900.0,
+    "early_stop_max_elapsed_s": 180.0,
    "adaptive_stop": {
      "enabled": true,
      "tau": 0.9,
--- a/configs/examples/dash0_qwen27b_ablation_naive_off.json
+++ b/configs/examples/dash0_qwen27b_ablation_naive_off.json
@@ -131,8 +131,8 @@
      "max_input_tokens": 8192
    },
    "replay_time_scale": 0.2,
-    "early_stop_max_lag_s": 120.0,
+    "early_stop_max_lag_s": 30.0,
-    "early_stop_max_elapsed_s": 900.0,
+    "early_stop_max_elapsed_s": 180.0,
    "adaptive_stop": {
      "enabled": true,
      "tau": 0.9,
--- a/docs/harness-ablation/harness-vs-naive-20260616.md
+++ b/docs/harness-ablation/harness-vs-naive-20260616.md
@@ -0,0 +1,62 @@
 # Harness vs naive agentic tuner — controlled ablation on dense Qwen3.5-27B — 2026-06-16
 Branch `main`. Quantifies the value of the paper's **harness** (domain-knowledge
 knob-family guidance) by running the agentic tuning loop twice on the *same*
 workload, identical in every respect except `llm.use_harness`:
 - **Harness ON** (`dash0_qwen27b_ablation_harness_on.json`, study
  `dash0-qwen27b-ablation-harness-on`): the prompt carries the `Harnesses:`
  section (ranked bottleneck hypotheses + per-knob-family use-when / procedure /
  guards, with an `active_now` flag), the loop can emit a deterministic
  harness-guided first probe, and a **Stop-B validator** gates the LLM's
  `should_stop` (an unauthorized stop is vetoed).
 - **Naive OFF** (`dash0_qwen27b_ablation_naive_off.json`, study
  `dash0-qwen27b-ablation-naive-off`): `use_harness=false`. No harness prompt
  section, no deterministic guided/stop proposals, and the LLM's own `should_stop`
  is honored without a validator veto. The prompt still tells the LLM that
  TP/DP/EP are tunable and gives the full study/SLO/trial-history context — so the
  difference is purely the harness guidance, this is the paper's "naive agentic
  tuner."
 The two config files differ in **exactly two keys** (`llm.use_harness` and
 `study_id`); verified by diff.
 ## Substrate (why these knobs, and the comparability caveat)
 This ablation measures the **tuning process** (proposal path + convergence), not
 absolute peak-rate, so a faster replay substrate is used to keep it tractable
 (at `replay_time_scale=1.0` a single TP4 trial took ~3 h — see
 `stop-b-e2e-27b-20260616.md`).
 | knob | value | rationale |
 | --- | --- | --- |
 | `trace.replay_time_scale` | **0.2** | arrival times are multiplied by 0.2, i.e. the same request set arrives in 1/5 the wall-clock → ~5× higher effective offered load. `arrival_s = timestamp * time_scale` (`trace.py:223`). Mild arrival-time compression: the lever the brief prescribes (compress time, do **not** just cut the elapsed cap). |
 | `search.high` | 0.25 | upper bound of the sampling_u binary search |
 | `search.max_probes` | 5 | probe budget per trial |
 | `--max-trials` | 8 | iteration budget |
 | Stop-A | **enabled** (unchanged) | converged-prefix replay truncation stays on for both runs |
 | SLO | length-aware TTFT (4s + L_in/8k) + TPOT ≤ 50 ms | unchanged from base |
 | GPUs | `CUDA_VISIBLE_DEVICES=2,3,4,5,6,7` | GPUs 0/1 avoided |
 **Comparability caveat.** Because arrival times are compressed 5×, the absolute
 `request_rate_per_gpu` values are **not** comparable to the scale=1.0 ground-truth
 climb (TP1 0.123 → TP2 0.29 → TP4 1.00). The ablation reads the **trajectory
 shape** (which knob family each iteration tries, whether the incumbent climbs
 monotonically, where each run stops) and the **relative** per-GPU ordering across
 topologies — not the absolute numbers.
 ## Run 1 — Harness ON
 <!-- TRAJECTORY_ON -->
 ## Run 2 — Naive OFF
 <!-- TRAJECTORY_OFF -->
 ## The five comparison metrics
 <!-- METRICS -->
 ## Analysis & caveats
 <!-- ANALYSIS -->
--- a/scripts/ablation_trajectory.py
+++ b/scripts/ablation_trajectory.py
@@ -0,0 +1,71 @@
 #!/usr/bin/env python3
 """Extract a per-iteration trajectory table from an ablation study store.
 Usage: python3 ablation_trajectory.py <study_store_dir>
 Prints iter, proposal source/name, config_patch summary, per_gpu, status,
 and the running incumbent per_gpu. Read-only.
 """
 import json
 import sys
 from pathlib import Path
 def topo(patch):
    fp = (patch or {}).get("flag_patch", {}) or {}
    ep = (patch or {}).get("env_patch", {}) or {}
    parts = []
    for k, label in (
        ("tensor-parallel-size", "TP"),
        ("data-parallel-size", "DP"),
        ("expert-parallel-size", "EP"),
    ):
        if k in fp:
            parts.append(f"{label}{fp[k]}")
    runtime = {
        k: v
        for k, v in fp.items()
        if k not in ("tensor-parallel-size", "data-parallel-size", "expert-parallel-size")
    }
    runtime.update({f"env:{k}": v for k, v in ep.items()})
    base = "+".join(parts) if parts else "baseline-topo"
    if runtime:
        base += " | " + ", ".join(f"{k}={v}" for k, v in runtime.items())
    return base
 def main():
    store = Path(sys.argv[1])
    state = json.load(open(store / "state.json"))
    print(f"study_id: {state.get('study_id')}")
    print(f"best_trial: {state.get('best_trial_id')}  best_per_gpu: {state.get('best_request_rate_per_gpu')}")
    print(f"stop_reason: {state.get('tuning_stop_reason')!r}")
    print(f"stop_diagnosis: {state.get('tuning_stop_diagnosis')!r}")
    print(f"stop_details: {json.dumps(state.get('tuning_stop_details'), ensure_ascii=False)}")
    print()
    incumbent = None
    hdr = f"{'iter':<5}{'trial':<11}{'status':<14}{'per_gpu':<10}{'incumbent':<11}config"
    print(hdr)
    print("-" * len(hdr))
    for i, t in enumerate(state.get("trials", []), 1):
        pg = t.get("best_request_rate_per_gpu")
        if pg is not None and (incumbent is None or pg > incumbent):
            incumbent = pg
        pgs = f"{pg:.4f}" if isinstance(pg, (int, float)) else str(pg)
        incs = f"{incumbent:.4f}" if isinstance(incumbent, (int, float)) else str(incumbent)
        print(
            f"{i:<5}{t.get('trial_id',''):<11}{str(t.get('status','')):<14}{pgs:<10}{incs:<11}{topo(t.get('config_patch'))}"
        )
    # also dump proposals dir to see what was *proposed* (incl. vetoed/failed)
    pdir = store / "proposals"
    if pdir.exists():
        print("\n-- proposal files (chronological) --")
        for p in sorted(pdir.glob("*.json")):
            try:
                pr = json.load(open(p))
            except Exception:
                continue
            print(f"  {p.stem}: should_stop={pr.get('should_stop')} | {topo(pr.get('config_patch'))}")
 if __name__ == "__main__":
    main()