Files
xtrain/crates
Gahow Wang b0086b5214 autodiff: bf16 mixed-precision path (fp32 master via cast op)
Tensor ops dispatch on dtype: fp32 branch unchanged (bit-identical),
bf16 branch routes matmul/attention through GemmEx and elementwise
through the bf16 kernels. Norm/softmax/RoPE/cross-entropy upcast to
fp32 around the existing fp32 kernels (standard AMP: reductions/loss
fp32, matmuls bf16). Transposes route bf16 through fp32 (pure layout).

New autodiff `cast` op is the AMP bridge: forward downcasts a fp32
master leaf to bf16 for the matmul; backward upcasts the bf16 grad
back to fp32. So the fp32 leaf accumulates an fp32 grad and AdamW /
clip / DDP all-reduce stay fp32 and completely unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:14:48 +08:00
..
2026-06-16 00:44:25 +08:00