- Thread-local launch stream (xserv_cuda::stream): every kernel
wrapper, cublasSetStream, and NCCL collective now launches on
current_stream_raw() — the legacy null stream by default (behavior
unchanged), or the capture stream installed via push_stream during
graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
quarantined (RetainedBlocks) instead of pooled, so an instantiated
graph keeps exclusive ownership of every intermediate buffer it
references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
token ids / positions from persistent device buffers, replacing the
per-call host upload that a captured region cannot contain.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
(not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor
Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)
TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>