2d48f25e66aa055b6ff183df402b8343cbb72c67
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%