Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three performance optimizations targeting decode throughput:
1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
- Specialized kernel for Q_len=1 (decode step)
- 256 threads parallelize across KV sequence dimension
- Online softmax with block-level warp-shuffle reduction
- Replaces FA2 kernel which wasted 63/64 threads for decode
- flash_attention() auto-dispatches when q_len==1
2. Fused SiLU×Mul (csrc/activation/activations.cu):
- Single kernel: out = silu(gate) * up
- Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
- Eliminates intermediate tensor allocation
3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
- Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
- Saves 1 full HBM round-trip per attention block
- Eliminates separate add + rmsnorm kernel pair
Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
marginal benefit because the bottleneck is cuBLAS GEMV overhead:
252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
overhead, or custom GEMV kernels to replace cuBLAS for M=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips
GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)
BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
(QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32
Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>