Files

Gahow Wang 51a0f2eb14 docs: add design docs + takeaways for Phase 2 and Phase 3

- docs/01-cuda-ffi.md: added takeaways (struct layout pitfall,
  Rust 2024 unsafe changes, caching allocator strategy, etc.)
- docs/02-tensor.md: design doc + takeaways for tensor abstraction
- docs/03-gemm.md: design doc + takeaways for GEMM kernels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-21 20:59:45 +08:00

4.7 KiB

Raw Blame History

Phase 0+1: CUDA FFI Infrastructure — Design Document

Goal

Build xserv-cuda, a Rust crate that wraps CUDA Runtime API with safe abstractions:

Device query and selection
GPU memory allocation with RAII (GpuBuffer)
Caching allocator (avoid repeated cudaMalloc/cudaFree)
CUDA streams for async operations
Host↔Device memory transfers
Error handling wrapping all CUDA calls

Module Layout

crates/xserv-cuda/
├── Cargo.toml
├── build.rs          # compiles csrc/*.cu via cc crate
└── src/
    ├── lib.rs        # re-exports
    ├── ffi.rs        # raw extern "C" bindings to CUDA runtime
    ├── error.rs      # CudaError type
    ├── device.rs     # device query, DeviceInfo
    ├── stream.rs     # CudaStream wrapper
    ├── memory.rs     # GpuBuffer, H2D/D2H/D2D copy
    └── allocator.rs  # CachingAllocator

Key Design Decisions

FFI Bindings (ffi.rs)

Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.

Core functions needed:

Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
Sync: cudaDeviceSynchronize
Error: cudaGetLastError, cudaGetErrorString

Error Handling (error.rs)

Every CUDA call returns cudaError_t. We wrap all calls:

pub(crate) fn check(code: i32) -> Result<(), CudaError>

GpuBuffer (memory.rs)

RAII wrapper around a GPU pointer. Drop frees memory.

pub struct GpuBuffer {
    ptr: *mut u8,
    len: usize,       // in bytes
    device: u32,
}

No Clone (explicit copy_from instead)
Send + !Sync (can move across threads, but not shared)

CachingAllocator (allocator.rs)

Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.

Bucket boundaries: round up to next power of 2, minimum 512 bytes.

alloc(size) → find bucket, pop from free list or cudaMalloc
dealloc(ptr, size) → push to free list (don't cudaFree)
trim() → actually cudaFree everything in free lists

CudaStream (stream.rs)

Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.

Build Pipeline

csrc/test/vecadd.cu: minimal vector-add kernel for smoke test
build.rs uses cc crate to compile .cu files, link CUDA runtime

Test Plan

Device info: print GPU name, memory, compute capability, SM count
GpuBuffer: alloc → H2D copy → D2H copy → verify data (256B, 64MB)
GpuBuffer: D2D copy 验证
GpuBuffer: zero fill 验证
Vector add kernel: launch from Rust, verify output
CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
CachingAllocator: 不同 size bucket 独立缓存
CudaStream: 创建、同步、Drop
PinnedBuffer: page-locked host memory
Async copy: H2D async + D2H async via stream

Takeaways

cudaDeviceProp struct 布局不可靠：CUDA 版本之间 cudaDeviceProp 的字段偏移会变化。我们最初用 struct 映射读取 total_global_mem，得到了垃圾值（12TB）。正确做法：用 cudaMemGetInfo 获取显存信息，用 cudaDeviceGetAttribute 获取其他属性。只从 cudaDeviceProp 读取 name 字段（始终在 struct 最前面，布局稳定）。
Rust 2024 edition 的 unsafe 语义变更：
- extern "C" 块必须加 unsafe 前缀 → unsafe extern "C"
- unsafe fn 内部的 unsafe 调用也需要显式 unsafe {} 块
- 这让代码更安全，但初次移植需要注意
cc crate 的 CUDA 支持是内置的：不需要 features = ["cuda"]（这个 feature 不存在）。只需 .cuda(true).cudart("shared")。
Caching Allocator 的 bucket 策略：round up to next power of 2（最小 512B）。这意味着申请 513B 会分配 1024B，存在内部碎片。但简单且高效——避免了 free list 中的精确匹配问题。PyTorch 的 CUDACachingAllocator 用了更复杂的策略（best-fit with splitting），但对于推理场景，power-of-2 bucket 已经够用。
into_raw + from_raw 模式：GpuBuffer 的 RAII Drop 和 CachingAllocator 的缓存需求冲突——allocator 需要持有裸指针而不触发 Drop。into_raw() 消费 self（mem::forget），返回裸指针；from_raw() 重新封装。这是 Rust 中管理 RAII 生命周期的标准模式。
dash5 环境：CUDA 12.9 已安装但 nvcc 不在 PATH（需要 /usr/local/cuda/bin）。Rust 需要手动安装 rustup。无 rsync，用 tar | ssh tar 同步代码。开发工作流：本地写码 → tar sync → 远程 build+test。

4.7 KiB Raw Blame History Unescape Escape