# Phase 0+1: CUDA FFI Infrastructure — Design Document ## Goal Build `xserv-cuda`, a Rust crate that wraps CUDA Runtime API with safe abstractions: - Device query and selection - GPU memory allocation with RAII (GpuBuffer) - Caching allocator (avoid repeated cudaMalloc/cudaFree) - CUDA streams for async operations - Host↔Device memory transfers - Error handling wrapping all CUDA calls ## Module Layout ``` crates/xserv-cuda/ ├── Cargo.toml ├── build.rs # compiles csrc/*.cu via cc crate └── src/ ├── lib.rs # re-exports ├── ffi.rs # raw extern "C" bindings to CUDA runtime ├── error.rs # CudaError type ├── device.rs # device query, DeviceInfo ├── stream.rs # CudaStream wrapper ├── memory.rs # GpuBuffer, H2D/D2H/D2D copy └── allocator.rs # CachingAllocator ``` ## Key Design Decisions ### FFI Bindings (ffi.rs) Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable. Core functions needed: - Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost - Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize - Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties - Sync: cudaDeviceSynchronize - Error: cudaGetLastError, cudaGetErrorString ### Error Handling (error.rs) Every CUDA call returns cudaError_t. We wrap all calls: ```rust pub(crate) fn check(code: i32) -> Result<(), CudaError> ``` ### GpuBuffer (memory.rs) RAII wrapper around a GPU pointer. Drop frees memory. ```rust pub struct GpuBuffer { ptr: *mut u8, len: usize, // in bytes device: u32, } ``` - No Clone (explicit copy_from instead) - Send + !Sync (can move across threads, but not shared) ### CachingAllocator (allocator.rs) Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket. Bucket boundaries: round up to next power of 2, minimum 512 bytes. - alloc(size) → find bucket, pop from free list or cudaMalloc - dealloc(ptr, size) → push to free list (don't cudaFree) - trim() → actually cudaFree everything in free lists ### CudaStream (stream.rs) Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy. ## Build Pipeline - `csrc/test/vecadd.cu`: minimal vector-add kernel for smoke test - `build.rs` uses `cc` crate to compile .cu files, link CUDA runtime ## Test Plan - [x] Device info: print GPU name, memory, compute capability, SM count - [x] GpuBuffer: alloc → H2D copy → D2H copy → verify data (256B, 64MB) - [x] GpuBuffer: D2D copy 验证 - [x] GpuBuffer: zero fill 验证 - [x] Vector add kernel: launch from Rust, verify output - [x] CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc) - [x] CachingAllocator: 不同 size bucket 独立缓存 - [x] CudaStream: 创建、同步、Drop - [x] PinnedBuffer: page-locked host memory - [x] Async copy: H2D async + D2H async via stream ## Takeaways 1. **`cudaDeviceProp` struct 布局不可靠**:CUDA 版本之间 `cudaDeviceProp` 的字段偏移会变化。我们最初用 struct 映射读取 `total_global_mem`,得到了垃圾值(12TB)。正确做法:用 `cudaMemGetInfo` 获取显存信息,用 `cudaDeviceGetAttribute` 获取其他属性。只从 `cudaDeviceProp` 读取 `name` 字段(始终在 struct 最前面,布局稳定)。 2. **Rust 2024 edition 的 unsafe 语义变更**: - `extern "C"` 块必须加 `unsafe` 前缀 → `unsafe extern "C"` - `unsafe fn` 内部的 unsafe 调用也需要显式 `unsafe {}` 块 - 这让代码更安全,但初次移植需要注意 3. **`cc` crate 的 CUDA 支持是内置的**:不需要 `features = ["cuda"]`(这个 feature 不存在)。只需 `.cuda(true).cudart("shared")`。 4. **Caching Allocator 的 bucket 策略**:round up to next power of 2(最小 512B)。这意味着申请 513B 会分配 1024B,存在内部碎片。但简单且高效——避免了 free list 中的精确匹配问题。PyTorch 的 CUDACachingAllocator 用了更复杂的策略(best-fit with splitting),但对于推理场景,power-of-2 bucket 已经够用。 5. **`into_raw` + `from_raw` 模式**:GpuBuffer 的 RAII Drop 和 CachingAllocator 的缓存需求冲突——allocator 需要持有裸指针而不触发 Drop。`into_raw()` 消费 self(`mem::forget`),返回裸指针;`from_raw()` 重新封装。这是 Rust 中管理 RAII 生命周期的标准模式。 6. **dash5 环境**:CUDA 12.9 已安装但 `nvcc` 不在 PATH(需要 `/usr/local/cuda/bin`)。Rust 需要手动安装 rustup。无 rsync,用 `tar | ssh tar` 同步代码。开发工作流:本地写码 → tar sync → 远程 build+test。