The GRPO (M4) token-level loss op + the one primitive it needs:
- scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The
clipped-PG backward scales each completion token's row of (probs − onehot) by
its own per-token coefficient, which cross_entropy_backward's single scalar
scale can't express.
- clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token
ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL
(k3 estimator), masked to completion tokens. Backward reuses the CE machinery
(probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0
(KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>