Files
aituner/docs/two-stop-summary.md
2026-06-16 19:16:28 +08:00

64 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Two-stop work — summary (feat/two-stop)
Aligns the tuning harness with paper.pdf by implementing and validating the two
distinct stopping mechanisms the paper conflates, plus a length-aware SLO and two
harness-robustness fixes. 117 unit tests pass.
## The two stops (they are different things)
- **Stop-A — evaluation sufficiency** (per replay / probe): how much trace to replay
before a measurement is trustworthy. Criterion: the replayed prefix's offered
L-C-A converges to the full set's. Saves per-evaluation GPU time.
- **Stop-B — tuning convergence** (across iterations): whether any config-improvement
opportunity remains; stop if not. Saves iterations.
## What was built
| Area | Summary | Commit |
| --- | --- | --- |
| Unify L-C-A | The prompt's `workload_lca_profile` is now the canonical 10-dim `lca.py` vector, not an ad-hoc re-derivation | `6f8e3c9` |
| Session-coherent load axis | `sampling_u` assigned per session (parent_chat_id chain) so thresholding keeps/drops whole multi-turn sessions, preserving intra-session KV reuse (C) across load levels | `0f15bbc` |
| Stop-A | `lca.find_convergence_prefix` (deterministic offered-L-C-A convergence prefix + C-gate: never declare infeasible on a cold cache); `spec.AdaptiveStopSpec` (default off); `worker._adaptive_replay_set` truncates replay + writes a certificate | `51a9e4a` |
| Stop-A boundary guard | Re-measure the full window when a truncated probe's pass-rate is within `boundary_delta` of the SLO target (fixes feasibility-knee false positives) | `dfc823f` |
| Stop-B authority | `harness._stop_authority` (mirrors the deterministic validator); `study tune` honors an LLM `should_stop` only if the validator agrees, else a bounded veto | `a8f9034` |
| Length-aware SLO | `linear_ms` rule: `threshold = intercept_ms + per_token_ms·L_in` (e.g. 4s + L_in/8k) | `ed2bbe0` |
| Robustness: timeout | HTTP stream now wraps `OSError`/`TimeoutError` as `HttpClientError` — a request exceeding `request_timeout_s` is a failed request, not a crashed trial | `2fcaf80` |
| Robustness: SIGTERM | `run_trial` installs a SIGTERM handler so killing `study tune` tears down the engine (and EngineCore workers) instead of orphaning it / leaking GPU memory | `b17b213` |
| Tooling | `stop_a_calibration.py` (CPU convergence curve), `stop_a_validate.py` (offline truncation-fidelity) | `08e53fd`, `3af1d84` |
A subagent code review found no blocking bugs (and independently validated session
coherence against a real trace); three minor fixes applied in `43125f4`.
## Validation results (real GPU runs, dash0 H20)
- **Stop-A** (`stop-a-validation-20260615.md`): CPU calibration reproduces the paper's
C-slowest ordering; GPU fidelity check on Qwen3-30B-A3B saves ~52% replay at τ_c=0.90
with 3/4 probe verdicts preserved; the one mismatch is a feasibility-knee false
positive that the **boundary guard fixes** — with the guard, best threshold matches
full replay exactly while still saving ~38% replay.
- **27B TP sweep** (`qwen27b-tp-sweep-20260616.md`): under the length-aware SLO, dense
Qwen3.5-27B per-GPU rises sharply with TP — TP1 0.065 → TP2 0.29 → TP4 ≥0.91 — the
**opposite** of the MoE 30B-A3B (TP1 best per-GPU). TP1 is TPOT-bound.
- **Stop-B** (`stop-b-e2e-20260615.md`, `stop-b-e2e-27b-20260616.md`): the 30B run
shows the deterministic `validator` stop + a premature-LLM-stop **veto**; the 27B run
(real gpt-5.4 loop) shows a genuine **improving climb** TP1 0.123 → TP2 0.29 → TP4
1.00 req/s/GPU (8.1×), each a correctly-diagnosed gain, then correctly **rejecting** a
TP4 runtime tweak that measured worse (no regression). The SIGTERM fix was validated
in practice (clean teardown, no leak).
## Open items (next round)
- **Harness convergence ablation (NOT yet done on this branch).** The paper's harness
result — domain-knowledge knob-family rules steering the LLM and cutting iterations —
has only *qualitative* evidence here (the 27B climb shows correct steering) plus older
smoke-regime ablations (`qwen27b-chat-0-8k-harness-fig18.md`: iters-to-best 4→2). A
controlled `use_harness=true` vs `false` (naive tuner) comparison on the 27B is the
missing quantified result.
- **Loop efficiency**: at scale=1.0 infeasible high-θ probes burn the
`early_stop_max_elapsed_s=900` cap (a TP4 trial took ~3h). Lower it to ~300s for
practical agentic loops.
- **dash0 GPUs 0/1** still hold leaked memory (pre-fix orphans) — needs a root
`nvidia-smi --gpu-reset`.
- Deferred: more robust C feature for the low-reuse regime; Stop-C cross-day retune
trigger (paper Q1); §7 baselines (SCOOT / naive / community).