launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was
larger. Halving the ceiling to 512 keeps two blocks per SM resident on
the large vocab kernels that dominate speculative verify workloads
without changing rows/block indexing, and never exceeds cols.
Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>