# Phase 0+1: CUDA FFI Infrastructure — Design Document

## Goal

Build `xserv-cuda`, a Rust crate that wraps CUDA Runtime API with safe abstractions:
- Device query and selection
- GPU memory allocation with RAII (GpuBuffer)
- Caching allocator (avoid repeated cudaMalloc/cudaFree)
- CUDA streams for async operations
- Host↔Device memory transfers
- Error handling wrapping all CUDA calls

## Module Layout

```
crates/xserv-cuda/
├── Cargo.toml
├── build.rs          # compiles csrc/*.cu via cc crate
└── src/
    ├── lib.rs        # re-exports
    ├── ffi.rs        # raw extern "C" bindings to CUDA runtime
    ├── error.rs      # CudaError type
    ├── device.rs     # device query, DeviceInfo
    ├── stream.rs     # CudaStream wrapper
    ├── memory.rs     # GpuBuffer, H2D/D2H/D2D copy
    └── allocator.rs  # CachingAllocator
```

## Key Design Decisions

### FFI Bindings (ffi.rs)
Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.

Core functions needed:
- Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
- Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
- Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
- Sync: cudaDeviceSynchronize
- Error: cudaGetLastError, cudaGetErrorString

### Error Handling (error.rs)
Every CUDA call returns cudaError_t. We wrap all calls:
```rust
pub(crate) fn check(code: i32) -> Result<(), CudaError>
```

### GpuBuffer (memory.rs)
RAII wrapper around a GPU pointer. Drop frees memory.
```rust
pub struct GpuBuffer {
    ptr: *mut u8,
    len: usize,       // in bytes
    device: u32,
}
```
- No Clone (explicit copy_from instead)
- Send + !Sync (can move across threads, but not shared)

### CachingAllocator (allocator.rs)
Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.

Bucket boundaries: round up to next power of 2, minimum 512 bytes.
- alloc(size) → find bucket, pop from free list or cudaMalloc
- dealloc(ptr, size) → push to free list (don't cudaFree)
- trim() → actually cudaFree everything in free lists

### CudaStream (stream.rs)
Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.

## Build Pipeline
- `csrc/test/vecadd.cu`: minimal vector-add kernel for smoke test
- `build.rs` uses `cc` crate to compile .cu files, link CUDA runtime

## Test Plan

- [x] Device info: print GPU name, memory, compute capability, SM count
- [x] GpuBuffer: alloc → H2D copy → D2H copy → verify data (256B, 64MB)
- [x] GpuBuffer: D2D copy 验证
- [x] GpuBuffer: zero fill 验证
- [x] Vector add kernel: launch from Rust, verify output
- [x] CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
- [x] CachingAllocator: 不同 size bucket 独立缓存
- [x] CudaStream: 创建、同步、Drop
- [x] PinnedBuffer: page-locked host memory
- [x] Async copy: H2D async + D2H async via stream

## Takeaways

1. **`cudaDeviceProp` struct 布局不可靠**：CUDA 版本之间 `cudaDeviceProp` 的字段偏移会变化。我们最初用 struct 映射读取 `total_global_mem`，得到了垃圾值（12TB）。正确做法：用 `cudaMemGetInfo` 获取显存信息，用 `cudaDeviceGetAttribute` 获取其他属性。只从 `cudaDeviceProp` 读取 `name` 字段（始终在 struct 最前面，布局稳定）。

2. **Rust 2024 edition 的 unsafe 语义变更**：
   - `extern "C"` 块必须加 `unsafe` 前缀 → `unsafe extern "C"`
   - `unsafe fn` 内部的 unsafe 调用也需要显式 `unsafe {}` 块
   - 这让代码更安全，但初次移植需要注意

3. **`cc` crate 的 CUDA 支持是内置的**：不需要 `features = ["cuda"]`（这个 feature 不存在）。只需 `.cuda(true).cudart("shared")`。

4. **Caching Allocator 的 bucket 策略**：round up to next power of 2（最小 512B）。这意味着申请 513B 会分配 1024B，存在内部碎片。但简单且高效——避免了 free list 中的精确匹配问题。PyTorch 的 CUDACachingAllocator 用了更复杂的策略（best-fit with splitting），但对于推理场景，power-of-2 bucket 已经够用。

5. **`into_raw` + `from_raw` 模式**：GpuBuffer 的 RAII Drop 和 CachingAllocator 的缓存需求冲突——allocator 需要持有裸指针而不触发 Drop。`into_raw()` 消费 self（`mem::forget`），返回裸指针；`from_raw()` 重新封装。这是 Rust 中管理 RAII 生命周期的标准模式。

6. **dash5 环境**：CUDA 12.9 已安装但 `nvcc` 不在 PATH（需要 `/usr/local/cuda/bin`）。Rust 需要手动安装 rustup。无 rsync，用 `tar | ssh tar` 同步代码。开发工作流：本地写码 → tar sync → 远程 build+test。