xtrain

Files

Gahow Wang f3c764ce95 post-train: M3 — seq_logprob + dpo_loss autograd ops

Two new ops for DPO (M3), both reusing existing kernels (no new CUDA):

- seq_logprob(logits, target): Σ log πθ(target) over non-ignored (target≥0)
  positions — the per-sequence logprob DPO compares between policy and
  reference. = −Σ per_row of cross_entropy (ignored rows already 0, like SFT
  masking); backward = cross_entropy_backward(probs, target, −upstream) (sum,
  no mean division). Gate: finite-diff grad-check with a -100 completion mask.

- dpo_loss(lpθ_chosen, lpθ_rejected, lpref_chosen, lpref_rejected, β): scalar
  L = −log σ(Δ) = softplus(−Δ) with the two policy logprobs as parents (ref
  logprobs constant). Gate: grad-check both parents + degenerate points
  (policy==ref ⇒ Δ=0, L=log2, grads ∓β/2; β=0 ⇒ grads 0). Same formula as TRL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 12:11:01 +08:00

xtrain-autodiff

post-train: M3 — seq_logprob + dpo_loss autograd ops

2026-06-30 12:11:01 +08:00

xtrain-cuda

post-train: M2 — decode primitives (rope_at + decode_attention)

2026-06-30 12:00:03 +08:00

xtrain-distributed

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

xtrain-model

post-train: M2a — KV-cache incremental decode engine (token-identical)