Three CUDA bugs from the review after 5b350ee / cfbd64d that were missed
by those commits:
- flash_attention.cu decode_attention_bf16_kernel used atomicAdd to
merge per-warp partials into smem_O — same nondeterminism pattern
that 5b350ee already fixed in paged_attention.cu and gemv.cu. This
kernel is on the legacy forward_gpu_cache path plus the speculative
bench baseline, so verify/decode parity depended on it. Replace with
smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order.
- causal_mask.cu computed the flat address as
`batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28
seq=32768 already overflows int32. Promote the index to long long.
- quantization/dequant_fp8.cu had `int total = num_experts * rows * cols`
and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows.
Same fix pattern as the MoE dense kernels in cfbd64d — 64-bit total /
idx / expert_stride, and grid computed in long long.