xserv

Go to file

Gahow Wang 2d48f25e66 phase 11: GPU-resident KV cache

- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
  (not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor

Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)

TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 11:50:12 +08:00

crates

phase 11: GPU-resident KV cache

2026-05-22 11:50:12 +08:00

csrc

phase 10: GPU add/mul kernels + BF16 precision analysis

2026-05-22 11:35:26 +08:00

docs

phase 11: GPU-resident KV cache

2026-05-22 11:50:12 +08:00

tools

phase 10: add Qwen3-8B benchmark + performance fix