xserv

Files

Gahow Wang 13ae3de69e kernels: reshape_and_cache, GPU argmax, single-launch GEMV

Three new CUDA kernels and one rewrite:

- reshape_and_cache: scatter K/V into paged pool in a single kernel per
  layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
  Includes both single-sequence (prefill) and batched (decode) variants.

- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
  decode now only D2H-transfers B×4 bytes (token ids) instead of the
  full [B, vocab] logits tensor.

- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
  the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-30 12:50:17 +08:00

causal_mask.cu

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

flash_attention.cu

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

paged_attention.cu

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

reshape_and_cache.cu

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00