Files
xtrain/crates
Gahow Wang 2f827fd6d8 post-train: M3 — DPO pair-gen + training loop (verifiable arithmetic)
gen_dpo_pairs: chosen = gold answer, rejected = the SFT model's own greedy
(KV-cache engine, M2a) completion when it's a format-valid WRONG boxed answer —
a hard negative from the model's distribution. ~8% of prompts skipped (greedy
correct). Writes question<TAB>chosen<TAB>rejected (bare, SFT-framed at train).

train_dpo: loads the SFT ckpt as policy AND frozen reference; precomputes the
reference logprobs ONCE (policy==ref) and caches them (one resident model). Each
step forwards the policy on chosen+rejected, seq_logprob each, minimises
dpo_loss; the two forwards share params so backward accumulates both branches.
Tracks reward margin + preference accuracy (the doc-13 "don't trust loss alone"
health signal). Loss starts at exactly log2 (Δ=0 at init) — a built-in check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:37:01 +08:00
..