Files
xtrain/docs
Gahow Wang 99090465bf docs: M3 — DPO results (infra correct, held-out correctness flat, over-optimization collapse)
Implementation log (docs/18) + Phase-3 row (evolution.md): the two ops + gates,
pair-gen (gold chosen / sampled-wrong rejected), reference-logprob caching, the
training loop, and the honest finding — reward margin + pref-acc rise but
held-out arithmetic correctness stays ~5-8% (flat within std-error) and
over-optimizes to collapse (margin +34 → 0% format). DPO reweights, it does not
install the capability; motivates M4 GRPO (optimize the verifiable reward online).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:38:06 +08:00
..