Add consolidated two-stop summary doc
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
63
docs/two-stop-summary.md
Normal file
63
docs/two-stop-summary.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# Two-stop work — summary (feat/two-stop)
|
||||
|
||||
Aligns the tuning harness with paper.pdf by implementing and validating the two
|
||||
distinct stopping mechanisms the paper conflates, plus a length-aware SLO and two
|
||||
harness-robustness fixes. 117 unit tests pass.
|
||||
|
||||
## The two stops (they are different things)
|
||||
|
||||
- **Stop-A — evaluation sufficiency** (per replay / probe): how much trace to replay
|
||||
before a measurement is trustworthy. Criterion: the replayed prefix's offered
|
||||
L-C-A converges to the full set's. Saves per-evaluation GPU time.
|
||||
- **Stop-B — tuning convergence** (across iterations): whether any config-improvement
|
||||
opportunity remains; stop if not. Saves iterations.
|
||||
|
||||
## What was built
|
||||
|
||||
| Area | Summary | Commit |
|
||||
| --- | --- | --- |
|
||||
| Unify L-C-A | The prompt's `workload_lca_profile` is now the canonical 10-dim `lca.py` vector, not an ad-hoc re-derivation | `6f8e3c9` |
|
||||
| Session-coherent load axis | `sampling_u` assigned per session (parent_chat_id chain) so thresholding keeps/drops whole multi-turn sessions, preserving intra-session KV reuse (C) across load levels | `0f15bbc` |
|
||||
| Stop-A | `lca.find_convergence_prefix` (deterministic offered-L-C-A convergence prefix + C-gate: never declare infeasible on a cold cache); `spec.AdaptiveStopSpec` (default off); `worker._adaptive_replay_set` truncates replay + writes a certificate | `51a9e4a` |
|
||||
| Stop-A boundary guard | Re-measure the full window when a truncated probe's pass-rate is within `boundary_delta` of the SLO target (fixes feasibility-knee false positives) | `dfc823f` |
|
||||
| Stop-B authority | `harness._stop_authority` (mirrors the deterministic validator); `study tune` honors an LLM `should_stop` only if the validator agrees, else a bounded veto | `a8f9034` |
|
||||
| Length-aware SLO | `linear_ms` rule: `threshold = intercept_ms + per_token_ms·L_in` (e.g. 4s + L_in/8k) | `ed2bbe0` |
|
||||
| Robustness: timeout | HTTP stream now wraps `OSError`/`TimeoutError` as `HttpClientError` — a request exceeding `request_timeout_s` is a failed request, not a crashed trial | `2fcaf80` |
|
||||
| Robustness: SIGTERM | `run_trial` installs a SIGTERM handler so killing `study tune` tears down the engine (and EngineCore workers) instead of orphaning it / leaking GPU memory | `b17b213` |
|
||||
| Tooling | `stop_a_calibration.py` (CPU convergence curve), `stop_a_validate.py` (offline truncation-fidelity) | `08e53fd`, `3af1d84` |
|
||||
|
||||
A subagent code review found no blocking bugs (and independently validated session
|
||||
coherence against a real trace); three minor fixes applied in `43125f4`.
|
||||
|
||||
## Validation results (real GPU runs, dash0 H20)
|
||||
|
||||
- **Stop-A** (`stop-a-validation-20260615.md`): CPU calibration reproduces the paper's
|
||||
C-slowest ordering; GPU fidelity check on Qwen3-30B-A3B saves ~52% replay at τ_c=0.90
|
||||
with 3/4 probe verdicts preserved; the one mismatch is a feasibility-knee false
|
||||
positive that the **boundary guard fixes** — with the guard, best threshold matches
|
||||
full replay exactly while still saving ~38% replay.
|
||||
- **27B TP sweep** (`qwen27b-tp-sweep-20260616.md`): under the length-aware SLO, dense
|
||||
Qwen3.5-27B per-GPU rises sharply with TP — TP1 0.065 → TP2 0.29 → TP4 ≥0.91 — the
|
||||
**opposite** of the MoE 30B-A3B (TP1 best per-GPU). TP1 is TPOT-bound.
|
||||
- **Stop-B** (`stop-b-e2e-20260615.md`, `stop-b-e2e-27b-20260616.md`): the 30B run
|
||||
shows the deterministic `validator` stop + a premature-LLM-stop **veto**; the 27B run
|
||||
(real gpt-5.4 loop) shows a genuine **improving climb** TP1 0.123 → TP2 0.29 → TP4
|
||||
1.00 req/s/GPU (8.1×), each a correctly-diagnosed gain, then correctly **rejecting** a
|
||||
TP4 runtime tweak that measured worse (no regression). The SIGTERM fix was validated
|
||||
in practice (clean teardown, no leak).
|
||||
|
||||
## Open items (next round)
|
||||
|
||||
- **Harness convergence ablation (NOT yet done on this branch).** The paper's harness
|
||||
result — domain-knowledge knob-family rules steering the LLM and cutting iterations —
|
||||
has only *qualitative* evidence here (the 27B climb shows correct steering) plus older
|
||||
smoke-regime ablations (`qwen27b-chat-0-8k-harness-fig18.md`: iters-to-best 4→2). A
|
||||
controlled `use_harness=true` vs `false` (naive tuner) comparison on the 27B is the
|
||||
missing quantified result.
|
||||
- **Loop efficiency**: at scale=1.0 infeasible high-θ probes burn the
|
||||
`early_stop_max_elapsed_s=900` cap (a TP4 trial took ~3h). Lower it to ~300s for
|
||||
practical agentic loops.
|
||||
- **dash0 GPUs 0/1** still hold leaked memory (pre-fix orphans) — needs a root
|
||||
`nvidia-smi --gpu-reset`.
|
||||
- Deferred: more robust C feature for the low-reuse regime; Stop-C cross-day retune
|
||||
trigger (paper Q1); §7 baselines (SCOOT / naive / community).
|
||||
Reference in New Issue
Block a user