xserv

Files

Gahow Wang 5f060902f6 cuda: fix remaining int32-address and nondeterministic-reduction bugs

Three CUDA bugs from the review after 5b350ee / cfbd64d that were missed
by those commits:

- flash_attention.cu decode_attention_bf16_kernel used atomicAdd to
  merge per-warp partials into smem_O — same nondeterminism pattern
  that 5b350ee already fixed in paged_attention.cu and gemv.cu. This
  kernel is on the legacy forward_gpu_cache path plus the speculative
  bench baseline, so verify/decode parity depended on it. Replace with
  smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order.
- causal_mask.cu computed the flat address as
  `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28
  seq=32768 already overflows int32. Promote the index to long long.
- quantization/dequant_fp8.cu had `int total = num_experts * rows * cols`
  and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows.
  Same fix pattern as the MoE dense kernels in cfbd64d — 64-bit total /
  idx / expert_stride, and grid computed in long long.

2026-07-01 15:13:07 +08:00

dequant_fp8.cu

cuda: fix remaining int32-address and nondeterministic-reduction bugs

2026-07-01 15:13:07 +08:00

mxfp4_gemm.cu

quantization: MXFP4 W4A16 expert weights (memory-optimization foundation)

2026-06-12 15:01:42 +08:00

quantize_fp8.cu

quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48

2026-06-12 01:23:29 +08:00