- docs/01-cuda-ffi.md: added takeaways (struct layout pitfall, Rust 2024 unsafe changes, caching allocator strategy, etc.) - docs/02-tensor.md: design doc + takeaways for tensor abstraction - docs/03-gemm.md: design doc + takeaways for GEMM kernels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.7 KiB
Phase 0+1: CUDA FFI Infrastructure — Design Document
Goal
Build xserv-cuda, a Rust crate that wraps CUDA Runtime API with safe abstractions:
- Device query and selection
- GPU memory allocation with RAII (GpuBuffer)
- Caching allocator (avoid repeated cudaMalloc/cudaFree)
- CUDA streams for async operations
- Host↔Device memory transfers
- Error handling wrapping all CUDA calls
Module Layout
crates/xserv-cuda/
├── Cargo.toml
├── build.rs # compiles csrc/*.cu via cc crate
└── src/
├── lib.rs # re-exports
├── ffi.rs # raw extern "C" bindings to CUDA runtime
├── error.rs # CudaError type
├── device.rs # device query, DeviceInfo
├── stream.rs # CudaStream wrapper
├── memory.rs # GpuBuffer, H2D/D2H/D2D copy
└── allocator.rs # CachingAllocator
Key Design Decisions
FFI Bindings (ffi.rs)
Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.
Core functions needed:
- Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
- Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
- Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
- Sync: cudaDeviceSynchronize
- Error: cudaGetLastError, cudaGetErrorString
Error Handling (error.rs)
Every CUDA call returns cudaError_t. We wrap all calls:
pub(crate) fn check(code: i32) -> Result<(), CudaError>
GpuBuffer (memory.rs)
RAII wrapper around a GPU pointer. Drop frees memory.
pub struct GpuBuffer {
ptr: *mut u8,
len: usize, // in bytes
device: u32,
}
- No Clone (explicit copy_from instead)
- Send + !Sync (can move across threads, but not shared)
CachingAllocator (allocator.rs)
Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.
Bucket boundaries: round up to next power of 2, minimum 512 bytes.
- alloc(size) → find bucket, pop from free list or cudaMalloc
- dealloc(ptr, size) → push to free list (don't cudaFree)
- trim() → actually cudaFree everything in free lists
CudaStream (stream.rs)
Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.
Build Pipeline
csrc/test/vecadd.cu: minimal vector-add kernel for smoke testbuild.rsusescccrate to compile .cu files, link CUDA runtime
Test Plan
- Device info: print GPU name, memory, compute capability, SM count
- GpuBuffer: alloc → H2D copy → D2H copy → verify data (256B, 64MB)
- GpuBuffer: D2D copy 验证
- GpuBuffer: zero fill 验证
- Vector add kernel: launch from Rust, verify output
- CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
- CachingAllocator: 不同 size bucket 独立缓存
- CudaStream: 创建、同步、Drop
- PinnedBuffer: page-locked host memory
- Async copy: H2D async + D2H async via stream
Takeaways
-
cudaDevicePropstruct 布局不可靠:CUDA 版本之间cudaDeviceProp的字段偏移会变化。我们最初用 struct 映射读取total_global_mem,得到了垃圾值(12TB)。正确做法:用cudaMemGetInfo获取显存信息,用cudaDeviceGetAttribute获取其他属性。只从cudaDeviceProp读取name字段(始终在 struct 最前面,布局稳定)。 -
Rust 2024 edition 的 unsafe 语义变更:
extern "C"块必须加unsafe前缀 →unsafe extern "C"unsafe fn内部的 unsafe 调用也需要显式unsafe {}块- 这让代码更安全,但初次移植需要注意
-
cccrate 的 CUDA 支持是内置的:不需要features = ["cuda"](这个 feature 不存在)。只需.cuda(true).cudart("shared")。 -
Caching Allocator 的 bucket 策略:round up to next power of 2(最小 512B)。这意味着申请 513B 会分配 1024B,存在内部碎片。但简单且高效——避免了 free list 中的精确匹配问题。PyTorch 的 CUDACachingAllocator 用了更复杂的策略(best-fit with splitting),但对于推理场景,power-of-2 bucket 已经够用。
-
into_raw+from_raw模式:GpuBuffer 的 RAII Drop 和 CachingAllocator 的缓存需求冲突——allocator 需要持有裸指针而不触发 Drop。into_raw()消费 self(mem::forget),返回裸指针;from_raw()重新封装。这是 Rust 中管理 RAII 生命周期的标准模式。 -
dash5 环境:CUDA 12.9 已安装但
nvcc不在 PATH(需要/usr/local/cuda/bin)。Rust 需要手动安装 rustup。无 rsync,用tar | ssh tar同步代码。开发工作流:本地写码 → tar sync → 远程 build+test。