phase 0+1: project scaffold + xserv-cuda crate

- Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 18:40:22 +08:00
commit 9806b4db35
16 changed files with 2629 additions and 0 deletions
--- a/docs/00-roadmap.md
+++ b/docs/00-roadmap.md
--- a/docs/01-cuda-ffi.md
+++ b/docs/01-cuda-ffi.md
@@ -0,0 +1,80 @@
+# Phase 0+1: CUDA FFI Infrastructure — Design Document
+
+## Goal
+
+Build `xserv-cuda`, a Rust crate that wraps CUDA Runtime API with safe abstractions:
+- Device query and selection
+- GPU memory allocation with RAII (GpuBuffer)
+- Caching allocator (avoid repeated cudaMalloc/cudaFree)
+- CUDA streams for async operations
+- Host↔Device memory transfers
+- Error handling wrapping all CUDA calls
+
+## Module Layout
+
+```
+crates/xserv-cuda/
+├── Cargo.toml
+├── build.rs          # compiles csrc/*.cu via cc crate
+└── src/
+    ├── lib.rs        # re-exports
+    ├── ffi.rs        # raw extern "C" bindings to CUDA runtime
+    ├── error.rs      # CudaError type
+    ├── device.rs     # device query, DeviceInfo
+    ├── stream.rs     # CudaStream wrapper
+    ├── memory.rs     # GpuBuffer, H2D/D2H/D2D copy
+    └── allocator.rs  # CachingAllocator
+```
+
+## Key Design Decisions
+
+### FFI Bindings (ffi.rs)
+Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.
+
+Core functions needed:
+- Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
+- Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
+- Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
+- Sync: cudaDeviceSynchronize
+- Error: cudaGetLastError, cudaGetErrorString
+
+### Error Handling (error.rs)
+Every CUDA call returns cudaError_t. We wrap all calls:
+```rust
+pub(crate) fn check(code: i32) -> Result<(), CudaError>
+```
+
+### GpuBuffer (memory.rs)
+RAII wrapper around a GPU pointer. Drop frees memory.
+```rust
+pub struct GpuBuffer {
+    ptr: *mut u8,
+    len: usize,       // in bytes
+    device: u32,
+}
+```
+- No Clone (explicit copy_from instead)
+- Send + !Sync (can move across threads, but not shared)
+
+### CachingAllocator (allocator.rs)
+Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.
+
+Bucket boundaries: round up to next power of 2, minimum 512 bytes.
+- alloc(size) → find bucket, pop from free list or cudaMalloc
+- dealloc(ptr, size) → push to free list (don't cudaFree)
+- trim() → actually cudaFree everything in free lists
+
+### CudaStream (stream.rs)
+Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.
+
+## Build Pipeline
+- `csrc/test/vecadd.cu`: minimal vector-add kernel for smoke test
+- `build.rs` uses `cc` crate to compile .cu files, link CUDA runtime
+
+## Test Plan
+1. Device info: print GPU name, memory, compute capability, SM count
+2. GpuBuffer: alloc 1GB, H2D copy, D2H copy, verify data
+3. Vector add kernel: launch from Rust, verify output
+4. CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
+5. Multi-stream: two concurrent memcpy on different streams
+6. Benchmark: caching allocator vs raw cudaMalloc (100 cycles)