Files

Gahow Wang 07f5d92e1d Add consolidated two-stop summary doc

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 19:16:28 +08:00

4.6 KiB

Raw Blame History

Two-stop work — summary (feat/two-stop)

Aligns the tuning harness with paper.pdf by implementing and validating the two distinct stopping mechanisms the paper conflates, plus a length-aware SLO and two harness-robustness fixes. 117 unit tests pass.

The two stops (they are different things)

Stop-A — evaluation sufficiency (per replay / probe): how much trace to replay before a measurement is trustworthy. Criterion: the replayed prefix's offered L-C-A converges to the full set's. Saves per-evaluation GPU time.
Stop-B — tuning convergence (across iterations): whether any config-improvement opportunity remains; stop if not. Saves iterations.

What was built

Area	Summary	Commit
Unify L-C-A	The prompt's `workload_lca_profile` is now the canonical 10-dim `lca.py` vector, not an ad-hoc re-derivation	`6f8e3c9`
Session-coherent load axis	`sampling_u` assigned per session (parent_chat_id chain) so thresholding keeps/drops whole multi-turn sessions, preserving intra-session KV reuse (C) across load levels	`0f15bbc`
Stop-A	`lca.find_convergence_prefix` (deterministic offered-L-C-A convergence prefix + C-gate: never declare infeasible on a cold cache); `spec.AdaptiveStopSpec` (default off); `worker._adaptive_replay_set` truncates replay + writes a certificate	`51a9e4a`
Stop-A boundary guard	Re-measure the full window when a truncated probe's pass-rate is within `boundary_delta` of the SLO target (fixes feasibility-knee false positives)	`dfc823f`
Stop-B authority	`harness._stop_authority` (mirrors the deterministic validator); `study tune` honors an LLM `should_stop` only if the validator agrees, else a bounded veto	`a8f9034`
Length-aware SLO	`linear_ms` rule: `threshold = intercept_ms + per_token_ms·L_in` (e.g. 4s + L_in/8k)	`ed2bbe0`
Robustness: timeout	HTTP stream now wraps `OSError`/`TimeoutError` as `HttpClientError` — a request exceeding `request_timeout_s` is a failed request, not a crashed trial	`2fcaf80`
Robustness: SIGTERM	`run_trial` installs a SIGTERM handler so killing `study tune` tears down the engine (and EngineCore workers) instead of orphaning it / leaking GPU memory	`b17b213`
Tooling	`stop_a_calibration.py` (CPU convergence curve), `stop_a_validate.py` (offline truncation-fidelity)	`08e53fd`, `3af1d84`

A subagent code review found no blocking bugs (and independently validated session coherence against a real trace); three minor fixes applied in 43125f4.

Validation results (real GPU runs, dash0 H20)

Stop-A (stop-a-validation-20260615.md): CPU calibration reproduces the paper's C-slowest ordering; GPU fidelity check on Qwen3-30B-A3B saves ~52% replay at τ_c=0.90 with 3/4 probe verdicts preserved; the one mismatch is a feasibility-knee false positive that the boundary guard fixes — with the guard, best threshold matches full replay exactly while still saving ~38% replay.
27B TP sweep (qwen27b-tp-sweep-20260616.md): under the length-aware SLO, dense Qwen3.5-27B per-GPU rises sharply with TP — TP1 0.065 → TP2 0.29 → TP4 ≥0.91 — the opposite of the MoE 30B-A3B (TP1 best per-GPU). TP1 is TPOT-bound.
Stop-B (stop-b-e2e-20260615.md, stop-b-e2e-27b-20260616.md): the 30B run shows the deterministic validator stop + a premature-LLM-stop veto; the 27B run (real gpt-5.4 loop) shows a genuine improving climb TP1 0.123 → TP2 0.29 → TP4 1.00 req/s/GPU (8.1×), each a correctly-diagnosed gain, then correctly rejecting a TP4 runtime tweak that measured worse (no regression). The SIGTERM fix was validated in practice (clean teardown, no leak).

Open items (next round)

Harness convergence ablation (NOT yet done on this branch). The paper's harness result — domain-knowledge knob-family rules steering the LLM and cutting iterations — has only qualitative evidence here (the 27B climb shows correct steering) plus older smoke-regime ablations (qwen27b-chat-0-8k-harness-fig18.md: iters-to-best 4→2). A controlled use_harness=true vs false (naive tuner) comparison on the 27B is the missing quantified result.
Loop efficiency: at scale=1.0 infeasible high-θ probes burn the early_stop_max_elapsed_s=900 cap (a TP4 trial took ~3h). Lower it to ~300s for practical agentic loops.
dash0 GPUs 0/1 still hold leaked memory (pre-fix orphans) — needs a root nvidia-smi --gpu-reset.
Deferred: more robust C feature for the low-reuse regime; Stop-C cross-day retune trigger (paper Q1); §7 baselines (SCOOT / naive / community).

4.6 KiB Raw Blame History Unescape Escape