Files
xserv/docs/01-cuda-ffi.md
Gahow Wang 51a0f2eb14 docs: add design docs + takeaways for Phase 2 and Phase 3
- docs/01-cuda-ffi.md: added takeaways (struct layout pitfall,
  Rust 2024 unsafe changes, caching allocator strategy, etc.)
- docs/02-tensor.md: design doc + takeaways for tensor abstraction
- docs/03-gemm.md: design doc + takeaways for GEMM kernels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 20:59:45 +08:00

4.7 KiB
Raw Blame History

Phase 0+1: CUDA FFI Infrastructure — Design Document

Goal

Build xserv-cuda, a Rust crate that wraps CUDA Runtime API with safe abstractions:

  • Device query and selection
  • GPU memory allocation with RAII (GpuBuffer)
  • Caching allocator (avoid repeated cudaMalloc/cudaFree)
  • CUDA streams for async operations
  • Host↔Device memory transfers
  • Error handling wrapping all CUDA calls

Module Layout

crates/xserv-cuda/
├── Cargo.toml
├── build.rs          # compiles csrc/*.cu via cc crate
└── src/
    ├── lib.rs        # re-exports
    ├── ffi.rs        # raw extern "C" bindings to CUDA runtime
    ├── error.rs      # CudaError type
    ├── device.rs     # device query, DeviceInfo
    ├── stream.rs     # CudaStream wrapper
    ├── memory.rs     # GpuBuffer, H2D/D2H/D2D copy
    └── allocator.rs  # CachingAllocator

Key Design Decisions

FFI Bindings (ffi.rs)

Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.

Core functions needed:

  • Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
  • Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
  • Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
  • Sync: cudaDeviceSynchronize
  • Error: cudaGetLastError, cudaGetErrorString

Error Handling (error.rs)

Every CUDA call returns cudaError_t. We wrap all calls:

pub(crate) fn check(code: i32) -> Result<(), CudaError>

GpuBuffer (memory.rs)

RAII wrapper around a GPU pointer. Drop frees memory.

pub struct GpuBuffer {
    ptr: *mut u8,
    len: usize,       // in bytes
    device: u32,
}
  • No Clone (explicit copy_from instead)
  • Send + !Sync (can move across threads, but not shared)

CachingAllocator (allocator.rs)

Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.

Bucket boundaries: round up to next power of 2, minimum 512 bytes.

  • alloc(size) → find bucket, pop from free list or cudaMalloc
  • dealloc(ptr, size) → push to free list (don't cudaFree)
  • trim() → actually cudaFree everything in free lists

CudaStream (stream.rs)

Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.

Build Pipeline

  • csrc/test/vecadd.cu: minimal vector-add kernel for smoke test
  • build.rs uses cc crate to compile .cu files, link CUDA runtime

Test Plan

  • Device info: print GPU name, memory, compute capability, SM count
  • GpuBuffer: alloc → H2D copy → D2H copy → verify data (256B, 64MB)
  • GpuBuffer: D2D copy 验证
  • GpuBuffer: zero fill 验证
  • Vector add kernel: launch from Rust, verify output
  • CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
  • CachingAllocator: 不同 size bucket 独立缓存
  • CudaStream: 创建、同步、Drop
  • PinnedBuffer: page-locked host memory
  • Async copy: H2D async + D2H async via stream

Takeaways

  1. cudaDeviceProp struct 布局不可靠CUDA 版本之间 cudaDeviceProp 的字段偏移会变化。我们最初用 struct 映射读取 total_global_mem得到了垃圾值12TB。正确做法cudaMemGetInfo 获取显存信息,用 cudaDeviceGetAttribute 获取其他属性。只从 cudaDeviceProp 读取 name 字段(始终在 struct 最前面,布局稳定)。

  2. Rust 2024 edition 的 unsafe 语义变更

    • extern "C" 块必须加 unsafe 前缀 → unsafe extern "C"
    • unsafe fn 内部的 unsafe 调用也需要显式 unsafe {}
    • 这让代码更安全,但初次移植需要注意
  3. cc crate 的 CUDA 支持是内置的:不需要 features = ["cuda"](这个 feature 不存在)。只需 .cuda(true).cudart("shared")

  4. Caching Allocator 的 bucket 策略round up to next power of 2最小 512B。这意味着申请 513B 会分配 1024B存在内部碎片。但简单且高效——避免了 free list 中的精确匹配问题。PyTorch 的 CUDACachingAllocator 用了更复杂的策略best-fit with splitting但对于推理场景power-of-2 bucket 已经够用。

  5. into_raw + from_raw 模式GpuBuffer 的 RAII Drop 和 CachingAllocator 的缓存需求冲突——allocator 需要持有裸指针而不触发 Drop。into_raw() 消费 selfmem::forget),返回裸指针;from_raw() 重新封装。这是 Rust 中管理 RAII 生命周期的标准模式。

  6. dash5 环境CUDA 12.9 已安装但 nvcc 不在 PATH需要 /usr/local/cuda/bin。Rust 需要手动安装 rustup。无 rsynctar | ssh tar 同步代码。开发工作流:本地写码 → tar sync → 远程 build+test。