Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd
(row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step
param/grad roundtrip), drop per-op sync + device memset. Includes the
verification table (regression suite green + tok/s 2770→8220 ~3x), the
deferred bf16/recompute follow-up rationale, and the T8 all-reduce note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>