Fig18 substrate: real output_length + criterion-A time_scale + Stop-A drain deadline

Replace the out=128 / scale=0.5 ablation substrate with a paper-faithful one: - Use the trace's real output_length (drop completion_tokens_override=128). The 0-8k chat window has p50=531 / p99=2436 / max=35168 output tokens, so decode (TPOT) becomes the dominant bottleneck instead of an artificial 128-token cap. - replay_time_scale=0.8775, chosen by criterion-A: binary-search the smallest scale whose A-family L-C-A similarity to the real (scale=1.0) arrivals stays >= tau (0.90). The old scale=0.5 had sim_A=0.56, distorting the arrival axis far below the tau bar used everywhere else. New calibrator: scripts/calibrate_time_scale.py. - Per-probe Stop-A-consistent drain deadline (worker._probe_drain_deadline): the wall-clock a *feasible* config needs to drain the LCA-admitted set (last_arrival + worst-case TTFT + p99_out * TPOT budget + margin). With real outputs decode dominates wall-clock, so the old fixed 320s cap would truncate the Stop-A offered window mid-decode. early_stop_max_elapsed_s (1000s) is now a hard ceiling; the per-probe deadline governs. The lag cap still cuts overload. 12-iter paired driver (both arms on dash1, removes the dash0/dash1 host confound): scripts/run_ablation_pair_d1.sh. 115 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 17:24:00 +08:00
parent 816765071f
commit 0c23285f39
6 changed files with 197 additions and 10 deletions
--- a/scripts/calibrate_time_scale.py
+++ b/scripts/calibrate_time_scale.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+"""Criterion-A time_scale calibration.
+
+Binary-search the smallest replay_time_scale whose A-family L-C-A similarity to the
+real (scale=1.0) arrival process stays >= tau. Uniform time scaling distorts only
+the A axis (rate + fano; interarrival CV is scale-invariant), so this bounds the
+arrival-axis distortion introduced by compression using the same similarity metric
+Stop-A uses. Pure trace metadata -> deterministic, no GPU needed.
+
+Usage:
+  PYTHONPATH=src python3 scripts/calibrate_time_scale.py \
+    --trace trace_windows/traces/chat_w20260311_1000.jsonl \
+    --gpu-count 8 --min-input 0 --max-input 8192 --tau 0.9
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import math
+from pathlib import Path
+
+from aituner.lca import _family_similarity, build_workload_profile
+from aituner.trace import TraceRequest, WindowRecord
+
+
+def load_rows(path: Path, lo: int, hi: int) -> list[dict]:
+    with path.open(encoding="utf-8") as fh:
+        rows = [json.loads(l) for l in fh if l.strip()]
+    return [r for r in rows if lo <= int(r["input_length"]) <= hi]
+
+
+def build_requests(rows: list[dict]) -> tuple[list[TraceRequest], float, float]:
+    reqs = []
+    for i, r in enumerate(rows):
+        reqs.append(
+            TraceRequest(
+                row_id=str(r.get("chat_id", i)),
+                arrival_s=float(r["timestamp"]),
+                sampling_u=float(r.get("sampling_u", 0.0)),
+                body={},
+                prompt_tokens_hint=int(r["input_length"]),
+                completion_tokens_hint=int(r["output_length"]),
+                metadata={"hash_ids": r.get("hash_ids") if isinstance(r.get("hash_ids"), list) else None},
+            )
+        )
+    amin = min(x.arrival_s for x in reqs)
+    amax = max(x.arrival_s for x in reqs)
+    return reqs, amin, amax
+
+
+def profile_at(reqs, amin, amax, gpu_count, scale):
+    rs = [
+        TraceRequest(
+            x.row_id, (x.arrival_s - amin) * scale, x.sampling_u, x.body,
+            x.prompt_tokens_hint, x.completion_tokens_hint, x.metadata,
+        )
+        for x in reqs
+    ]
+    span = (amax - amin) * scale
+    w = WindowRecord(
+        window_id="w", trace_path="", trace_type="chat",
+        window_start=0.0, window_end=span, source_payload={"block_size": 64},
+    )
+    return build_workload_profile(rs, w, gpu_count=gpu_count, length_mode="total")
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--trace", type=Path, required=True)
+    ap.add_argument("--gpu-count", type=int, default=8)
+    ap.add_argument("--min-input", type=int, default=0)
+    ap.add_argument("--max-input", type=int, default=8192)
+    ap.add_argument("--tau", type=float, default=0.9)
+    args = ap.parse_args()
+
+    rows = load_rows(args.trace, args.min_input, args.max_input)
+    reqs, amin, amax = build_requests(rows)
+    print(f"n={len(reqs)}  raw arrival span={amax - amin:.1f}s")
+    base = profile_at(reqs, amin, amax, args.gpu_count, 1.0)
+    print(f"{'scale':>6} {'simA':>7} {'rate/gpu':>9} {'fano':>8} {'span_s':>8}")
+    for s in (1.0, 0.95, 0.9, 0.85, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2):
+        p = profile_at(reqs, amin, amax, args.gpu_count, s)
+        a = _family_similarity(base.vector, p.vector)["A"]
+        print(f"{s:6.2f} {a:7.3f} {math.expm1(p.vector[7]):9.3f} {math.expm1(p.vector[9]):8.2f} {(amax-amin)*s:8.1f}")
+
+    lo, hi = 0.05, 1.0
+    for _ in range(40):
+        mid = (lo + hi) / 2
+        a = _family_similarity(base.vector, profile_at(reqs, amin, amax, args.gpu_count, mid).vector)["A"]
+        if a >= args.tau:
+            hi = mid
+        else:
+            lo = mid
+    print(f"\nsmallest scale with simA>={args.tau}: {hi:.4f}  (arrival span {(amax-amin)*hi:.0f}s)")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/run_ablation_pair_d1.sh
+++ b/scripts/run_ablation_pair_d1.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+# 12-iteration harness-vs-naive ablation, both arms on dash1 (clean paired run,
+# no host confound). Substrate: real output_length (no completion override),
+# replay_time_scale=0.8775 (criterion-A, sim_A>=0.90), Stop-A on (LCA offered
+# window), per-probe Stop-A-consistent drain deadline. Harness stops early; naive
+# runs the full budget. Run from the repo root on dash1.
+set -u
+export OPENAI_API_KEY=$(python3 -c 'import json,pathlib;print(json.load(open(pathlib.Path.home()/".codex/auth.json"))["OPENAI_API_KEY"])')
+# codex config.toml points at a dash0-local proxy (127.0.0.1:11235); on dash1 the
+# LLM endpoint is reachable directly, so force a direct connection.
+export http_proxy= https_proxy= all_proxy= HTTP_PROXY= HTTPS_PROXY= ALL_PROXY= no_proxy='*'
+mkdir -p .aituner
+rm -rf .aituner/abl12-harness .aituner/abl12-naive .aituner/ABLATION12_DONE
+
+echo "=== harness ON  (12-iter) start $(date -Is) ==="
+PYTHONPATH=src python3 -m aituner.cli study tune \
+  --spec configs/examples/dash0_qwen27b_ablation_harness_on.json \
+  --store-root .aituner/abl12-harness --max-trials 12 --skip-baseline > .aituner/abl12-harness.log 2>&1
+echo "=== harness ON  (12-iter) done  $(date -Is) ==="
+
+echo "=== naive OFF   (12-iter) start $(date -Is) ==="
+PYTHONPATH=src python3 -m aituner.cli study tune \
+  --spec configs/examples/dash0_qwen27b_ablation_naive_off.json \
+  --store-root .aituner/abl12-naive --max-trials 12 --skip-baseline > .aituner/abl12-naive.log 2>&1
+echo "=== naive OFF   (12-iter) done  $(date -Is) ==="
+
+touch .aituner/ABLATION12_DONE