4 Commits

Author SHA1 Message Date
a67753f516 softmax: cap block size at 512 threads
launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was
larger. Halving the ceiling to 512 keeps two blocks per SM resident on
the large vocab kernels that dominate speculative verify workloads
without changing rows/block indexing, and never exceeds cols.
2026-07-01 14:16:32 +08:00
Gahow Wang
13ae3de69e kernels: reshape_and_cache, GPU argmax, single-launch GEMV
Three new CUDA kernels and one rewrite:

- reshape_and_cache: scatter K/V into paged pool in a single kernel per
  layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
  Includes both single-sequence (prefill) and batched (decode) variants.

- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
  decode now only D2H-transfers B×4 bytes (token ids) instead of the
  full [B, vocab] logits tensor.

- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
  the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:17 +08:00
4c3f914459 kernels/cuda: paged-attention kernel, dispatch, pinned host memory
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
  activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
  pool backing) and offset-aware D2H/H2D copies used to move KV blocks
  between the GPU pool and pinned host memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:36 +08:00
c8e8153702 phase 4: transformer core kernels
CUDA kernels (csrc/):
- common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max
- normalization/rmsnorm.cu: RMSNorm (F32 + BF16)
- normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16)
- activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16)
- reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16)
- embedding/embedding.cu: gather lookup (F32 + BF16)
- embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16)

Rust wrappers (xserv-kernels/src/):
- rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs
- RopeCache struct with GPU-side precomputation

Tests: 12 new tests (ops_test.rs), all passing with good precision:
- F32: max_err 1e-6 ~ 1e-9
- BF16: max_err 2e-3 ~ 7e-3
Total: 29 kernel tests + 27 prior = 56 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:07:24 +08:00