4.6 KiB
4.6 KiB
Two-stop work — summary (feat/two-stop)
Aligns the tuning harness with paper.pdf by implementing and validating the two distinct stopping mechanisms the paper conflates, plus a length-aware SLO and two harness-robustness fixes. 117 unit tests pass.
The two stops (they are different things)
- Stop-A — evaluation sufficiency (per replay / probe): how much trace to replay before a measurement is trustworthy. Criterion: the replayed prefix's offered L-C-A converges to the full set's. Saves per-evaluation GPU time.
- Stop-B — tuning convergence (across iterations): whether any config-improvement opportunity remains; stop if not. Saves iterations.
What was built
| Area | Summary | Commit |
|---|---|---|
| Unify L-C-A | The prompt's workload_lca_profile is now the canonical 10-dim lca.py vector, not an ad-hoc re-derivation |
6f8e3c9 |
| Session-coherent load axis | sampling_u assigned per session (parent_chat_id chain) so thresholding keeps/drops whole multi-turn sessions, preserving intra-session KV reuse (C) across load levels |
0f15bbc |
| Stop-A | lca.find_convergence_prefix (deterministic offered-L-C-A convergence prefix + C-gate: never declare infeasible on a cold cache); spec.AdaptiveStopSpec (default off); worker._adaptive_replay_set truncates replay + writes a certificate |
51a9e4a |
| Stop-A boundary guard | Re-measure the full window when a truncated probe's pass-rate is within boundary_delta of the SLO target (fixes feasibility-knee false positives) |
dfc823f |
| Stop-B authority | harness._stop_authority (mirrors the deterministic validator); study tune honors an LLM should_stop only if the validator agrees, else a bounded veto |
a8f9034 |
| Length-aware SLO | linear_ms rule: threshold = intercept_ms + per_token_ms·L_in (e.g. 4s + L_in/8k) |
ed2bbe0 |
| Robustness: timeout | HTTP stream now wraps OSError/TimeoutError as HttpClientError — a request exceeding request_timeout_s is a failed request, not a crashed trial |
2fcaf80 |
| Robustness: SIGTERM | run_trial installs a SIGTERM handler so killing study tune tears down the engine (and EngineCore workers) instead of orphaning it / leaking GPU memory |
b17b213 |
| Tooling | stop_a_calibration.py (CPU convergence curve), stop_a_validate.py (offline truncation-fidelity) |
08e53fd, 3af1d84 |
A subagent code review found no blocking bugs (and independently validated session
coherence against a real trace); three minor fixes applied in 43125f4.
Validation results (real GPU runs, dash0 H20)
- Stop-A (
stop-a-validation-20260615.md): CPU calibration reproduces the paper's C-slowest ordering; GPU fidelity check on Qwen3-30B-A3B saves ~52% replay at τ_c=0.90 with 3/4 probe verdicts preserved; the one mismatch is a feasibility-knee false positive that the boundary guard fixes — with the guard, best threshold matches full replay exactly while still saving ~38% replay. - 27B TP sweep (
qwen27b-tp-sweep-20260616.md): under the length-aware SLO, dense Qwen3.5-27B per-GPU rises sharply with TP — TP1 0.065 → TP2 0.29 → TP4 ≥0.91 — the opposite of the MoE 30B-A3B (TP1 best per-GPU). TP1 is TPOT-bound. - Stop-B (
stop-b-e2e-20260615.md,stop-b-e2e-27b-20260616.md): the 30B run shows the deterministicvalidatorstop + a premature-LLM-stop veto; the 27B run (real gpt-5.4 loop) shows a genuine improving climb TP1 0.123 → TP2 0.29 → TP4 1.00 req/s/GPU (8.1×), each a correctly-diagnosed gain, then correctly rejecting a TP4 runtime tweak that measured worse (no regression). The SIGTERM fix was validated in practice (clean teardown, no leak).
Open items (next round)
- Harness convergence ablation (NOT yet done on this branch). The paper's harness
result — domain-knowledge knob-family rules steering the LLM and cutting iterations —
has only qualitative evidence here (the 27B climb shows correct steering) plus older
smoke-regime ablations (
qwen27b-chat-0-8k-harness-fig18.md: iters-to-best 4→2). A controlleduse_harness=truevsfalse(naive tuner) comparison on the 27B is the missing quantified result. - Loop efficiency: at scale=1.0 infeasible high-θ probes burn the
early_stop_max_elapsed_s=900cap (a TP4 trial took ~3h). Lower it to ~300s for practical agentic loops. - dash0 GPUs 0/1 still hold leaked memory (pre-fix orphans) — needs a root
nvidia-smi --gpu-reset. - Deferred: more robust C feature for the low-reuse regime; Stop-C cross-day retune trigger (paper Q1); §7 baselines (SCOOT / naive / community).