xtrain

Files

Gahow Wang 08c88bf360 gemm: tiled F32 forward + transpose + backward (dA/dB)

Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate,
boundary-masked) plus an out-of-place transpose kernel. Wire both through
xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level:
Tensor::matmul, transpose_2d, and matmul_backward computing
dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the
forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a
correctness reference in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 15:26:51 +08:00

ops

gemm: tiled F32 forward + transpose + backward (dA/dB)

2026-06-15 15:26:51 +08:00

test

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00