Three CUDA bugs from the review after 5b350ee / cfbd64d that were missed
by those commits:
- flash_attention.cu decode_attention_bf16_kernel used atomicAdd to
merge per-warp partials into smem_O — same nondeterminism pattern
that 5b350ee already fixed in paged_attention.cu and gemv.cu. This
kernel is on the legacy forward_gpu_cache path plus the speculative
bench baseline, so verify/decode parity depended on it. Replace with
smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order.
- causal_mask.cu computed the flat address as
`batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28
seq=32768 already overflows int32. Promote the index to long long.
- quantization/dequant_fp8.cu had `int total = num_experts * rows * cols`
and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows.
Same fix pattern as the MoE dense kernels in cfbd64d — 64-bit total /
idx / expert_stride, and grid computed in long long.
flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the
causal `continue`); an early tile entirely outside the sliding window was
still processed with every key masked to -inf, so row_max == -INFINITY.
Folding that into the online softmax computed expf(-inf - (-inf)) = NaN,
and the next valid tile's 0*NaN correction then poisoned the whole row.
Result: the gpt-oss prefill produced all-NaN logits for any query whose
sliding window (128) starts past the first KV tile — i.e. at longer
context — collapsing generation into a single repeated token (argmax of
all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was
the residual multi-turn/long-context collapse.
Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes
nothing to the softmax. The decode kernel already guards
local_max == -INFINITY, so it was unaffected.
Verified on dash5 (TP=2): the prefill that previously went all-NaN now
produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long
prior answer) completes correctly instead of emitting "!!!!".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.
Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
pool backing) and offset-aware D2H/H2D copies used to move KV blocks
between the GPU pool and pinned host memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three performance optimizations targeting decode throughput:
1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
- Specialized kernel for Q_len=1 (decode step)
- 256 threads parallelize across KV sequence dimension
- Online softmax with block-level warp-shuffle reduction
- Replaces FA2 kernel which wasted 63/64 threads for decode
- flash_attention() auto-dispatches when q_len==1
2. Fused SiLU×Mul (csrc/activation/activations.cu):
- Single kernel: out = silu(gate) * up
- Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
- Eliminates intermediate tensor allocation
3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
- Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
- Saves 1 full HBM round-trip per attention block
- Eliminates separate add + rmsnorm kernel pair
Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
marginal benefit because the bottleneck is cuBLAS GEMV overhead:
252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
overhead, or custom GEMV kernels to replace cuBLAS for M=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>