phase 0+1: project scaffold + xserv-cuda crate
- Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
1754
docs/00-roadmap.md
Normal file
1754
docs/00-roadmap.md
Normal file
File diff suppressed because it is too large
Load Diff
80
docs/01-cuda-ffi.md
Normal file
80
docs/01-cuda-ffi.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Phase 0+1: CUDA FFI Infrastructure — Design Document
|
||||
|
||||
## Goal
|
||||
|
||||
Build `xserv-cuda`, a Rust crate that wraps CUDA Runtime API with safe abstractions:
|
||||
- Device query and selection
|
||||
- GPU memory allocation with RAII (GpuBuffer)
|
||||
- Caching allocator (avoid repeated cudaMalloc/cudaFree)
|
||||
- CUDA streams for async operations
|
||||
- Host↔Device memory transfers
|
||||
- Error handling wrapping all CUDA calls
|
||||
|
||||
## Module Layout
|
||||
|
||||
```
|
||||
crates/xserv-cuda/
|
||||
├── Cargo.toml
|
||||
├── build.rs # compiles csrc/*.cu via cc crate
|
||||
└── src/
|
||||
├── lib.rs # re-exports
|
||||
├── ffi.rs # raw extern "C" bindings to CUDA runtime
|
||||
├── error.rs # CudaError type
|
||||
├── device.rs # device query, DeviceInfo
|
||||
├── stream.rs # CudaStream wrapper
|
||||
├── memory.rs # GpuBuffer, H2D/D2H/D2D copy
|
||||
└── allocator.rs # CachingAllocator
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
### FFI Bindings (ffi.rs)
|
||||
Hand-written extern "C" bindings (~25 functions). No bindgen — keeps things explicit and readable.
|
||||
|
||||
Core functions needed:
|
||||
- Memory: cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyAsync, cudaMallocHost, cudaFreeHost
|
||||
- Stream: cudaStreamCreate, cudaStreamDestroy, cudaStreamSynchronize
|
||||
- Device: cudaGetDeviceCount, cudaSetDevice, cudaGetDevice, cudaGetDeviceProperties
|
||||
- Sync: cudaDeviceSynchronize
|
||||
- Error: cudaGetLastError, cudaGetErrorString
|
||||
|
||||
### Error Handling (error.rs)
|
||||
Every CUDA call returns cudaError_t. We wrap all calls:
|
||||
```rust
|
||||
pub(crate) fn check(code: i32) -> Result<(), CudaError>
|
||||
```
|
||||
|
||||
### GpuBuffer (memory.rs)
|
||||
RAII wrapper around a GPU pointer. Drop frees memory.
|
||||
```rust
|
||||
pub struct GpuBuffer {
|
||||
ptr: *mut u8,
|
||||
len: usize, // in bytes
|
||||
device: u32,
|
||||
}
|
||||
```
|
||||
- No Clone (explicit copy_from instead)
|
||||
- Send + !Sync (can move across threads, but not shared)
|
||||
|
||||
### CachingAllocator (allocator.rs)
|
||||
Avoids cudaMalloc/cudaFree per allocation. Maintains a free-list keyed by size bucket.
|
||||
|
||||
Bucket boundaries: round up to next power of 2, minimum 512 bytes.
|
||||
- alloc(size) → find bucket, pop from free list or cudaMalloc
|
||||
- dealloc(ptr, size) → push to free list (don't cudaFree)
|
||||
- trim() → actually cudaFree everything in free lists
|
||||
|
||||
### CudaStream (stream.rs)
|
||||
Wraps cudaStream_t. RAII with Drop calling cudaStreamDestroy.
|
||||
|
||||
## Build Pipeline
|
||||
- `csrc/test/vecadd.cu`: minimal vector-add kernel for smoke test
|
||||
- `build.rs` uses `cc` crate to compile .cu files, link CUDA runtime
|
||||
|
||||
## Test Plan
|
||||
1. Device info: print GPU name, memory, compute capability, SM count
|
||||
2. GpuBuffer: alloc 1GB, H2D copy, D2H copy, verify data
|
||||
3. Vector add kernel: launch from Rust, verify output
|
||||
4. CachingAllocator: alloc→free→realloc same size uses cache (no new cudaMalloc)
|
||||
5. Multi-stream: two concurrent memcpy on different streams
|
||||
6. Benchmark: caching allocator vs raw cudaMalloc (100 cycles)
|
||||
Reference in New Issue
Block a user