xtrain

Files

Gahow Wang 2f827fd6d8 post-train: M3 — DPO pair-gen + training loop (verifiable arithmetic)

gen_dpo_pairs: chosen = gold answer, rejected = the SFT model's own greedy
(KV-cache engine, M2a) completion when it's a format-valid WRONG boxed answer —
a hard negative from the model's distribution. ~8% of prompts skipped (greedy
correct). Writes question<TAB>chosen<TAB>rejected (bare, SFT-framed at train).

train_dpo: loads the SFT ckpt as policy AND frozen reference; precomputes the
reference logprobs ONCE (policy==ref) and caches them (one resident model). Each
step forwards the policy on chosen+rejected, seq_logprob each, minimises
dpo_loss; the two forwards share params so backward accumulates both branches.
Tracks reward margin + preference accuracy (the doc-13 "don't trust loss alone"
health signal). Loss starts at exactly log2 (Δ=0 at init) — a built-in check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 12:37:01 +08:00

xtrain-autodiff

post-train: M3 — seq_logprob + dpo_loss autograd ops

2026-06-30 12:11:01 +08:00

xtrain-cuda

post-train: M2 — decode primitives (rope_at + decode_attention)

2026-06-30 12:00:03 +08:00

xtrain-distributed

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-model

post-train: M2a — KV-cache incremental decode engine (token-identical)