phase 3: GEMM kernels (naive, tiled, cuBLAS)

- Naive GEMM kernel: one thread per output element (F32 + BF16) - Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16) - cuBLAS wrapper: cublasGemmEx with row-major trick - GemmBackend enum for runtime backend selection - CublasContext RAII handle - Made error::check public for cross-crate use - 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16 - Cross-backend consistency verified (naive vs tiled vs cuBLAS) - All 44 tests pass across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:48:05 +08:00
parent a83971fa25
commit d77f921a12
9 changed files with 519 additions and 1 deletions
--- a/crates/xserv-cuda/src/error.rs
+++ b/crates/xserv-cuda/src/error.rs
@@ -23,7 +23,7 @@ impl std::error::Error for CudaError {}

 pub type Result<T> = std::result::Result<T, CudaError>;

-pub(crate) fn check(code: i32) -> Result<()> {
+pub fn check(code: i32) -> Result<()> {
    if code == ffi::CUDA_SUCCESS {
        return Ok(());
    }