Files
xtrain/csrc
Gahow Wang 08c88bf360 gemm: tiled F32 forward + transpose + backward (dA/dB)
Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate,
boundary-masked) plus an out-of-place transpose kernel. Wire both through
xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level:
Tensor::matmul, transpose_2d, and matmul_backward computing
dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the
forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a
correctness reference in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:51 +08:00
..