Implementation log (docs/18) + Phase-3 row (evolution.md): the two ops + gates,
pair-gen (gold chosen / sampled-wrong rejected), reference-logprob caching, the
training loop, and the honest finding — reward margin + pref-acc rise but
held-out arithmetic correctness stays ~5-8% (flat within std-error) and
over-optimizes to collapse (margin +34 → 0% format). DPO reweights, it does not
install the capability; motivates M4 GRPO (optimize the verifiable reward online).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>