Document Stop-A validation: calibration + GPU fidelity check
CPU calibration (chat vs coder) reproduces the paper's C-slowest ordering and shows C-convergence difficulty is driven by signal noise (low-reuse chat) not reuse magnitude. GPU fidelity check on Qwen3-30B-A3B: truncating at the L-C-A convergence prefix saves ~52% replay (tau_c=0.90) with 3/4 probe verdicts preserved; the one mismatch is a boundary false-positive at the feasibility knee (prefix 0.96 vs full 0.946), caused by second-half engine-state drift the offered L-C-A cannot see. Argues for revisiting the SLO-boundary guard before enabling. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
96
docs/harness-ablation/stop-a-validation-20260615.md
Normal file
96
docs/harness-ablation/stop-a-validation-20260615.md
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
# Stop-A validation (Phase 3) — 2026-06-15
|
||||||
|
|
||||||
|
Branch `feat/two-stop`. Stop-A = truncate each probe's replay once the offered
|
||||||
|
L-C-A of the replayed prefix converges to the full set (pure L-C-A criterion +
|
||||||
|
C-gate). This note records the CPU calibration and the GPU fidelity check.
|
||||||
|
|
||||||
|
## 1. Calibration (CPU, no serving)
|
||||||
|
|
||||||
|
`scripts/stop_a_calibration.py` on the dash0 0321 10:00–10:10 windows:
|
||||||
|
|
||||||
|
| dim | chat (19239 req, hit≈7%) | coder (2451 req, structured reuse) |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| A | ≥0.95 by frac 0.10 | fast |
|
||||||
|
| L | ≥0.96 from frac 0.05 | 0.05=0.75 (heavy tail) → ≥0.94 by 0.20 |
|
||||||
|
| **C (slowest)** | noisy, dips (0.50→0.885, 0.55→0.835), stable ≥0.92 only ~0.85 | smooth, stable ≥0.92 by ~0.70 |
|
||||||
|
|
||||||
|
Stop fraction (τ_L=τ_A=0.90, W=3):
|
||||||
|
|
||||||
|
| τ_c | chat | coder |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| 0.85 | 0.45 (273s) | 0.45 (255s) |
|
||||||
|
| 0.90 | 0.70 (423s) | 0.55 (318s) |
|
||||||
|
| 0.92 | 0.85 (513s) | 0.70 (411s) |
|
||||||
|
|
||||||
|
Findings:
|
||||||
|
- **C is the slowest dimension in both workloads** — reproduces paper §5.2 / Fig 9.
|
||||||
|
- **What makes C hard to call converged is signal *noise*, not reuse magnitude.**
|
||||||
|
Low-reuse chat has a sparse/spiky ideal-hit-length series, so its C similarity
|
||||||
|
oscillates and is *harder* to stabilize than the structured, higher-reuse coder.
|
||||||
|
Consequence: a strict τ_c (0.92) gives chat only ~15% saving. A more robust C
|
||||||
|
feature for the low-reuse regime is future work.
|
||||||
|
|
||||||
|
## 2. GPU fidelity check (Qwen3-30B-A3B, vLLM 0.11.1, H20)
|
||||||
|
|
||||||
|
One full-window run (`adaptive_stop` disabled, `replay_time_scale=1.0`, window
|
||||||
|
`chat_w20260311_1000`, 0–8k, out=128), then `scripts/stop_a_validate.py`
|
||||||
|
recomputes each probe's convergence prefix and compares the truncated verdict to
|
||||||
|
the full verdict — so a single GPU run validates truncation fidelity (no second run).
|
||||||
|
|
||||||
|
Trial result: best feasible `sampling_u=0.078125`, request_rate **2.30 req/s**,
|
||||||
|
pass_rate 0.973.
|
||||||
|
|
||||||
|
Per-probe verdict (τ=0.9):
|
||||||
|
|
||||||
|
| τ_c | verdict matches | mean replay saved |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| 0.85 | 3/4 | 54% |
|
||||||
|
| 0.90 | 3/4 | 52% |
|
||||||
|
| 0.92 | 3/4 | 38% |
|
||||||
|
|
||||||
|
The mismatch is the same probe at every τ_c — the feasibility knee `0.08594`:
|
||||||
|
|
||||||
|
```
|
||||||
|
thresh full_pass prefix_pass full_feas prefix_feas
|
||||||
|
0.08594 0.946 0.956–0.961 False True <- mismatch
|
||||||
|
0.07812 0.973 0.987–0.990 True True
|
||||||
|
0.06250 0.986 1.000 True True
|
||||||
|
0.09375 0.268 0.49–0.54 False False
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Interpretation
|
||||||
|
|
||||||
|
- **Stop-A works and saves ~50% of replay** (vs the full 600 s window) while
|
||||||
|
preserving 3/4 probe verdicts. (The paper's ~70% is vs a 30-min fixed baseline;
|
||||||
|
our baseline is the 600 s window, so the percentages are not directly comparable.)
|
||||||
|
- **The one failure is a boundary false-positive at the feasibility knee.** At
|
||||||
|
`0.08594` the full window is 0.946 (just below the 0.95 SLO) but the prefix is
|
||||||
|
0.956–0.961 (just above): the *second half* of the window degraded — engine-state
|
||||||
|
drift (KV fill / fragmentation / later-arriving harder requests) that the
|
||||||
|
*offered* L-C-A cannot see. The C-gate did not help because offered-C had
|
||||||
|
converged; the divergence is in the measured pass-rate, not in C.
|
||||||
|
- If Stop-A were enabled, the binary search would accept `0.08594`, overestimating
|
||||||
|
the peak sustainable rate by one binary step (~10%).
|
||||||
|
|
||||||
|
**This is the boundary jitter we accepted when choosing the pure-L-C-A criterion.**
|
||||||
|
The data now argues for revisiting the previously-declined **SLO-boundary guard**:
|
||||||
|
keep replaying while the measured pass-rate is within ±δ of the target, even after
|
||||||
|
L-C-A converges. It targets exactly this knee case at low extra cost (it only
|
||||||
|
extends replay on probes sitting on the feasibility boundary). Recommend adding it
|
||||||
|
as a small Stop-A enhancement before enabling Stop-A in production studies.
|
||||||
|
|
||||||
|
## Repro
|
||||||
|
|
||||||
|
```
|
||||||
|
# calibration
|
||||||
|
PYTHONPATH=src python3 scripts/stop_a_calibration.py \
|
||||||
|
--jsonl <DIR>/qwen_chat_blksz_64_032109-032111.jsonl --block-size 64 \
|
||||||
|
--window-start 3600 --window-end 4200 --gpu-count 8 --label chat
|
||||||
|
# GPU run + fidelity
|
||||||
|
PYTHONPATH=src python3 -m aituner.cli study tune \
|
||||||
|
--spec configs/examples/dash0_qwen30b_a3b_stopA_fulldata.json \
|
||||||
|
--store-root .aituner/stopA-fulldata --max-trials 1
|
||||||
|
PYTHONPATH=src python3 scripts/stop_a_validate.py \
|
||||||
|
--spec configs/examples/dash0_qwen30b_a3b_stopA_fulldata.json \
|
||||||
|
--store-root .aituner/stopA-fulldata --tau 0.9 --tau-c 0.90
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user