phase 11: GPU-resident KV cache
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset - Per-head strided layout [num_kv_heads, max_seq_len, head_dim] - Fixed critical bug: seq_len must advance AFTER all layers write (not inside the loop per-layer) - GpuBuffer::copy_from_device_at for offset-based D2D copy - Tensor::from_storage constructor for wrapping raw GPU buffers - Exported Storage and Dims from xserv-tensor Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical Performance: ~neutral (KV cache was never the main bottleneck — reshape/merge/transpose CPU round-trips dominate for Qwen3-8B) TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -87,6 +87,20 @@ impl GpuBuffer {
|
||||
error::check(unsafe { ffi::cudaMemset(self.ptr, 0, self.len) })
|
||||
}
|
||||
|
||||
/// Copy `count` bytes from `src` buffer at `src_offset` to this buffer at `dst_offset`.
|
||||
pub fn copy_from_device_at(&mut self, src: &GpuBuffer, src_offset: usize, dst_offset: usize, count: usize) -> Result<()> {
|
||||
assert!(src_offset + count <= src.len);
|
||||
assert!(dst_offset + count <= self.len);
|
||||
error::check(unsafe {
|
||||
ffi::cudaMemcpy(
|
||||
self.ptr.add(dst_offset),
|
||||
src.ptr.add(src_offset),
|
||||
count,
|
||||
ffi::CUDA_MEMCPY_D2D,
|
||||
)
|
||||
})
|
||||
}
|
||||
|
||||
/// Consume the buffer without freeing GPU memory. Returns the raw pointer and length.
|
||||
/// Caller is responsible for eventually calling cudaFree.
|
||||
pub fn into_raw(self) -> (*mut u8, usize) {
|
||||
|
||||
Reference in New Issue
Block a user