sample() at temperature 0 copied the full [seq, 201088] BF16 logits
to the host and scanned them every token (~1 ms/token). Use the
Phase 15 argmax kernel (block reduction + 4-byte D2H) when logits are
BF16 on GPU; bench-gpt-oss's greedy sampler likewise. Exact-tie
logits may break differently than the host scan — greedy trajectories
can legitimately diverge at a tie token (GSM8K unchanged).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>