xserv

Files

Gahow Wang d77f921a12 phase 3: GEMM kernels (naive, tiled, cuBLAS)

- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 19:48:05 +08:00

naive.cu

phase 3: GEMM kernels (naive, tiled, cuBLAS)

2026-05-21 19:48:05 +08:00

tiled.cu

phase 3: GEMM kernels (naive, tiled, cuBLAS)

2026-05-21 19:48:05 +08:00