Three new CUDA kernels and one rewrite:
- reshape_and_cache: scatter K/V into paged pool in a single kernel per
layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
Includes both single-sequence (prefill) and batched (decode) variants.
- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
decode now only D2H-transfers B×4 bytes (token ids) instead of the
full [B, vocab] logits tensor.
- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>