phase 11: GPU-resident KV cache

- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 11:50:12 +08:00
parent be5c64ea8a
commit 2d48f25e66
9 changed files with 274 additions and 14 deletions
--- a/crates/xserv-cuda/src/memory.rs
+++ b/crates/xserv-cuda/src/memory.rs
@@ -87,6 +87,20 @@ impl GpuBuffer {
        error::check(unsafe { ffi::cudaMemset(self.ptr, 0, self.len) })
    }

+    /// Copy `count` bytes from `src` buffer at `src_offset` to this buffer at `dst_offset`.
+    pub fn copy_from_device_at(&mut self, src: &GpuBuffer, src_offset: usize, dst_offset: usize, count: usize) -> Result<()> {
+        assert!(src_offset + count <= src.len);
+        assert!(dst_offset + count <= self.len);
+        error::check(unsafe {
+            ffi::cudaMemcpy(
+                self.ptr.add(dst_offset),
+                src.ptr.add(src_offset),
+                count,
+                ffi::CUDA_MEMCPY_D2D,
+            )
+        })
+    }
+
    /// Consume the buffer without freeing GPU memory. Returns the raw pointer and length.
    /// Caller is responsible for eventually calling cudaFree.
    pub fn into_raw(self) -> (*mut u8, usize) {