Files
xserv/csrc/quantization
Gahow Wang 5f060902f6 cuda: fix remaining int32-address and nondeterministic-reduction bugs
Three CUDA bugs from the review after 5b350ee / cfbd64d that were missed
by those commits:

- flash_attention.cu decode_attention_bf16_kernel used atomicAdd to
  merge per-warp partials into smem_O — same nondeterminism pattern
  that 5b350ee already fixed in paged_attention.cu and gemv.cu. This
  kernel is on the legacy forward_gpu_cache path plus the speculative
  bench baseline, so verify/decode parity depended on it. Replace with
  smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order.
- causal_mask.cu computed the flat address as
  `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28
  seq=32768 already overflows int32. Promote the index to long long.
- quantization/dequant_fp8.cu had `int total = num_experts * rows * cols`
  and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows.
  Same fix pattern as the MoE dense kernels in cfbd64d — 64-bit total /
  idx / expert_stride, and grid computed in long long.
2026-07-01 15:13:07 +08:00
..