xserv

Files

Gahow Wang 9a01c60100 server: GPU argmax fast path for greedy decode

When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-30 12:50:47 +08:00

xserv-cuda

cuda: add cached_trim() to release pooled GPU buffers

2026-05-30 12:50:04 +08:00

xserv-distributed

distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)

2026-05-29 11:10:14 +08:00

xserv-kernels

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00

xserv-model

model: fuse QKV/gate_up projections, batched decode ops