Files
xserv/crates
Gahow Wang 8414f8d1e6 sampling: GPU argmax fast path for greedy decode
sample() at temperature 0 copied the full [seq, 201088] BF16 logits
to the host and scanned them every token (~1 ms/token). Use the
Phase 15 argmax kernel (block reduction + 4-byte D2H) when logits are
BF16 on GPU; bench-gpt-oss's greedy sampler likewise. Exact-tie
logits may break differently than the host scan — greedy trajectories
can legitimately diverge at a tie token (GSM8K unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:12:37 +08:00
..