Two new ops for DPO (M3), both reusing existing kernels (no new CUDA):
- seq_logprob(logits, target): Σ log πθ(target) over non-ignored (target≥0)
positions — the per-sequence logprob DPO compares between policy and
reference. = −Σ per_row of cross_entropy (ignored rows already 0, like SFT
masking); backward = cross_entropy_backward(probs, target, −upstream) (sum,
no mean division). Gate: finite-diff grad-check with a -100 completion mask.
- dpo_loss(lpθ_chosen, lpθ_rejected, lpref_chosen, lpref_rejected, β): scalar
L = −log σ(Δ) = softplus(−Δ) with the two policy logprobs as parents (ref
logprobs constant). Gate: grad-check both parents + degenerate points
(policy==ref ⇒ Δ=0, L=log2, grads ∓β/2; β=0 ⇒ grads 0). Same formula as TRL.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>