Files
xserv/docs/01-cuda-ffi.md
Gahow Wang 9806b4db35 phase 0+1: project scaffold + xserv-cuda crate
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 18:40:22 +08:00

2.8 KiB

Phase 0+1: CUDA FFI Infrastructure — Design Document

Goal

Build xserv-cuda, a Rust crate that wraps CUDA Runtime API with safe abstractions:

  • Device query and selection
  • GPU memory allocation with RAII (GpuBuffer)
  • Caching allocator (avoid repeated cudaMalloc/cudaFree)
  • CUDA streams for async operations
  • Host↔Device memory transfers
  • Error handling wrapping all CUDA calls

Module Layout

crates/xserv-cuda/
├── Cargo.toml
├── build.rs          # compiles csrc/*.cu via cc crate
└── src/
    ├── lib.rs        # re-exports
    ├── ffi.rs        # raw extern "C" bindings to CUDA runtime
    ├── error.rs      # CudaError type
    ├── device.rs     # device query, DeviceInfo
    ├── stream.rs     # CudaStream wrapper
    ├── memory.rs     # GpuBuffer, H2D/D2H/D2D copy
    └── allocator.rs  # CachingAllocator

Key Design Decisions

FFI Bindings (ffi.rs)

Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.

Core functions needed:

  • Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
  • Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
  • Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
  • Sync: cudaDeviceSynchronize
  • Error: cudaGetLastError, cudaGetErrorString

Error Handling (error.rs)

Every CUDA call returns cudaError_t. We wrap all calls:

pub(crate) fn check(code: i32) -> Result<(), CudaError>

GpuBuffer (memory.rs)

RAII wrapper around a GPU pointer. Drop frees memory.

pub struct GpuBuffer {
    ptr: *mut u8,
    len: usize,       // in bytes
    device: u32,
}
  • No Clone (explicit copy_from instead)
  • Send + !Sync (can move across threads, but not shared)

CachingAllocator (allocator.rs)

Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.

Bucket boundaries: round up to next power of 2, minimum 512 bytes.

  • alloc(size) → find bucket, pop from free list or cudaMalloc
  • dealloc(ptr, size) → push to free list (don't cudaFree)
  • trim() → actually cudaFree everything in free lists

CudaStream (stream.rs)

Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.

Build Pipeline

  • csrc/test/vecadd.cu: minimal vector-add kernel for smoke test
  • build.rs uses cc crate to compile .cu files, link CUDA runtime

Test Plan

  1. Device info: print GPU name, memory, compute capability, SM count
  2. GpuBuffer: alloc 1GB, H2D copy, D2H copy, verify data
  3. Vector add kernel: launch from Rust, verify output
  4. CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
  5. Multi-stream: two concurrent memcpy on different streams
  6. Benchmark: caching allocator vs raw cudaMalloc (100 cycles)