xserv

Files

Gahow Wang cc4bd4cfe5 paged-kv: kernel-based scatter + fix data_ptr offset bug

Replace the Rust cudaMemcpy loop in append_tokens() with the new
reshape_and_cache kernel. Add append_tokens_batched() for the decode
path using the batched variant.

Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that
tensor offset is respected. The old code silently read from storage base
(element 0) instead of the tensor's logical start, which produced wrong
results when K/V tensors were narrow() views into a fused QKV buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-30 12:50:28 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)

2026-05-29 11:10:14 +08:00

xserv-kernels

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00

xserv-model

paged-kv: kernel-based scatter + fix data_ptr offset bug