phase 3: GEMM kernels (naive, tiled, cuBLAS)

- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 19:48:05 +08:00
parent a83971fa25
commit d77f921a12
9 changed files with 519 additions and 1 deletions

View File

@@ -23,7 +23,7 @@ impl std::error::Error for CudaError {}
pub type Result<T> = std::result::Result<T, CudaError>;
pub(crate) fn check(code: i32) -> Result<()> {
pub fn check(code: i32) -> Result<()> {
if code == ffi::CUDA_SUCCESS {
return Ok(());
}