11 Commits

Author SHA1 Message Date
531cd3fe08 style: format Rust workspace 2026-06-18 18:11:58 +08:00
4088f49b7d cuda: infrastructure for whole-step CUDA graph capture
- Thread-local launch stream (xserv_cuda::stream): every kernel
  wrapper, cublasSetStream, and NCCL collective now launches on
  current_stream_raw() — the legacy null stream by default (behavior
  unchanged), or the capture stream installed via push_stream during
  graph capture. Capture is impossible on the legacy stream.
- Allocator retain mode: blocks freed inside a retain window are
  quarantined (RetainedBlocks) instead of pooled, so an instantiated
  graph keeps exclusive ownership of every intermediate buffer it
  references across replays.
- Capture mode GLOBAL -> THREAD_LOCAL: concurrent TP rank threads
  must not poison each other's captures with their own cudaMallocs.
- embedding_device_ids / rope_inplace_device_pos: variants reading
  token ids / positions from persistent device buffers, replacing the
  per-call host upload that a captured region cannot contain.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
Gahow Wang
6ce21345be cuda: add cached_trim() to release pooled GPU buffers
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:04 +08:00
4c3f914459 kernels/cuda: paged-attention kernel, dispatch, pinned host memory
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
  activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
  pool backing) and offset-aware D2H/H2D copies used to move KV blocks
  between the GPU pool and pinned host memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:36 +08:00
986a289616 fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090
P0 fixes (blocking usability):
- FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul)
- FIX-16: EOS token no longer leaks into API responses
- FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256)
- FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400

P1 fixes (bugs & performance):
- FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat)
- FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety
- FIX-09: tokenizer byte_fallback graceful degradation (was panic)
- FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf)
- FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm
- FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches

P2 fixes (improvements):
- FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations
- FIX-23: RoPE cache no longer artificially capped at 8192 positions

Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:13:43 +08:00
d5532ef209 phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
Two optimizations:

1. Tensor::empty() — skip cudaMemset for output tensors
   All kernel wrappers that fully overwrite their output now use
   Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
   cudaMemset calls per decode step (21 per layer × 36 layers).
   Improvement: 46.6 → 50.3 tok/s (+8%).

2. CUDA Graph infrastructure (for future use)
   Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
   cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
   forward pass due to variable kv_len, but provides foundation for
   future graph-based decode optimization.

Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):

| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 23:57:34 +08:00
ee68d3565d fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul
Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).

Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)

Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams

Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)

Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
  for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)

Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:53:28 +08:00
2d48f25e66 phase 11: GPU-resident KV cache
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
  (not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor

Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)

TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 11:50:12 +08:00
d77f921a12 phase 3: GEMM kernels (naive, tiled, cuBLAS)
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:48:05 +08:00
c8f7bc0c3c phase 0+1: fix Rust 2024 edition compat + memory query
- unsafe extern "C" blocks (Rust 2024 requirement)
- unsafe blocks inside unsafe fn bodies
- Use cudaMemGetInfo for accurate GPU memory reporting
- Remove cc "cuda" feature (doesn't exist, built-in)
- All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:40:49 +08:00
9806b4db35 phase 0+1: project scaffold + xserv-cuda crate
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 18:40:22 +08:00