xserv

Files

Gahow Wang a67753f516 softmax: cap block size at 512 threads

launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was
larger. Halving the ceiling to 512 keeps two blocks per SM resident on
the large vocab kernels that dominate speculative verify workloads
without changing rows/block indexing, and never exceeds cols.

2026-07-01 14:16:32 +08:00

argmax.cu

kernels: reshape_and_cache, GPU argmax, single-launch GEMV

2026-05-30 12:50:17 +08:00

softmax.cu

softmax: cap block size at 512 threads

2026-07-01 14:16:32 +08:00