xserv

Files

Gahow Wang 8414f8d1e6 sampling: GPU argmax fast path for greedy decode

sample() at temperature 0 copied the full [seq, 201088] BF16 logits
to the host and scanned them every token (~1 ms/token). Use the
Phase 15 argmax kernel (block reduction + 4-byte D2H) when logits are
BF16 on GPU; bench-gpt-oss's greedy sampler likewise. Exact-tie
logits may break differently than the host scan — greedy trajectories
can legitimately diverge at a tie token (GSM8K unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 20:12:37 +08:00

xserv-cuda

cuda: infrastructure for whole-step CUDA graph capture

2026-06-12 20:12:37 +08:00

xserv-distributed

cuda: infrastructure for whole-step CUDA graph capture

2026-06-12 20:12:37 +08:00

xserv-kernels

cuda: infrastructure for whole-step CUDA graph capture

2026-06-12 20:12:37 +08:00

xserv-model

sampling: GPU argmax fast path for greedy decode