xserv

Author	SHA1	Message	Date
Gahow Wang	531cd3fe08	style: format Rust workspace	2026-06-18 18:11:58 +08:00
Gahow Wang	4088f49b7d	cuda: infrastructure for whole-step CUDA graph capture - Thread-local launch stream (xserv_cuda::stream): every kernel wrapper, cublasSetStream, and NCCL collective now launches on current_stream_raw() — the legacy null stream by default (behavior unchanged), or the capture stream installed via push_stream during graph capture. Capture is impossible on the legacy stream. - Allocator retain mode: blocks freed inside a retain window are quarantined (RetainedBlocks) instead of pooled, so an instantiated graph keeps exclusive ownership of every intermediate buffer it references across replays. - Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads must not poison each other's captures with their own cudaMallocs. - embedding_device_ids / rope_inplace_device_pos: variants reading token ids / positions from persistent device buffers, replacing the per-call host upload that a captured region cannot contain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 20:12:37 +08:00
Gahow Wang	6ce21345be	cuda: add cached_trim() to release pooled GPU buffers Exposes the caching allocator's trim() through a public free function. Called after weight fusion during model loading to free temporary buffers that would otherwise sit in the pool and cause OOM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:04 +08:00
Gahow Wang	4c3f914459	kernels/cuda: paged-attention kernel, dispatch, pinned host memory CUDA layer for the paged-KV + swap work: - csrc: new paged_attention.cu plus updates across attention/gemm/norm/ activation/embedding/reduce kernels and common.cuh. - xserv-kernels: new dispatch module and kernel-binding updates. - xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap pool backing) and offset-aware D2H/H2D copies used to move KV blocks between the GPU pool and pinned host memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:36 +08:00
Gahow Wang	986a289616	fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090 P0 fixes (blocking usability): - FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul) - FIX-16: EOS token no longer leaks into API responses - FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256) - FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400 P1 fixes (bugs & performance): - FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat) - FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety - FIX-09: tokenizer byte_fallback graceful degradation (was panic) - FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf) - FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm - FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches P2 fixes (improvements): - FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations - FIX-23: RoPE cache no longer artificially capped at 8192 positions Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:43 +08:00
Gahow Wang	d5532ef209	phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline) Two optimizations: 1. Tensor::empty() — skip cudaMemset for output tensors All kernel wrappers that fully overwrite their output now use Tensor::empty() instead of Tensor::zeros(). Eliminates ~756 cudaMemset calls per decode step (21 per layer × 36 layers). Improvement: 46.6 → 50.3 tok/s (+8%). 2. CUDA Graph infrastructure (for future use) Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate, cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the forward pass due to variable kv_len, but provides foundation for future graph-based decode optimization. Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode): \| Optimization \| tok/s \| vs HF \| Roofline \| \|-------------\|-------\|-------\|----------\| \| Phase 14 baseline \| 12.9 \| 36% \| 12% \| \| + Fused kernels \| 13.2 \| 37% \| 12% \| \| + Batched decode \| 13.2 (serial) \| 37% \| 12% \| \| + Custom GEMV \| 46.6 \| 130% \| 42% \| \| + Tensor::empty \| 50.3 \| 140% \| 45% \| Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 23:57:34 +08:00
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	2d48f25e66	phase 11: GPU-resident KV cache - GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 11:50:12 +08:00
Gahow Wang	d77f921a12	phase 3: GEMM kernels (naive, tiled, cuBLAS) - Naive GEMM kernel: one thread per output element (F32 + BF16) - Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16) - cuBLAS wrapper: cublasGemmEx with row-major trick - GemmBackend enum for runtime backend selection - CublasContext RAII handle - Made error::check public for cross-crate use - 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16 - Cross-backend consistency verified (naive vs tiled vs cuBLAS) - All 44 tests pass across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:48:05 +08:00
Gahow Wang	c8f7bc0c3c	phase 0+1: fix Rust 2024 edition compat + memory query - unsafe extern "C" blocks (Rust 2024 requirement) - unsafe blocks inside unsafe fn bodies - Use cudaMemGetInfo for accurate GPU memory reporting - Remove cc "cuda" feature (doesn't exist, built-in) - All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:40:49 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

11 Commits