Files
xtrain/docs
Gahow Wang 5e8add2a41 docs: Phase T7 — performance
Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd
(row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step
param/grad roundtrip), drop per-op sync + device memset. Includes the
verification table (regression suite green + tok/s 2770→8220 ~3x), the
deferred bf16/recompute follow-up rationale, and the T8 all-reduce note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:00:29 +08:00
..