xserv

Author	SHA1	Message	Date
Gahow Wang	5f060902f6	cuda: fix remaining int32-address and nondeterministic-reduction bugs Three CUDA bugs from the review after `5b350ee` / `cfbd64d` that were missed by those commits: - flash_attention.cu decode_attention_bf16_kernel used atomicAdd to merge per-warp partials into smem_O — same nondeterminism pattern that `5b350ee` already fixed in paged_attention.cu and gemv.cu. This kernel is on the legacy forward_gpu_cache path plus the speculative bench baseline, so verify/decode parity depended on it. Replace with smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order. - causal_mask.cu computed the flat address as `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28 seq=32768 already overflows int32. Promote the index to long long. - quantization/dequant_fp8.cu had `int total = num_experts * rows * cols` and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows. Same fix pattern as the MoE dense kernels in `cfbd64d` — 64-bit total / idx / expert_stride, and grid computed in long long.	2026-07-01 15:13:07 +08:00
Gahow Wang	5157b2cd30	kernels: fix NaN in flash-attention sinks on fully-masked window tiles flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the causal `continue`); an early tile entirely outside the sliding window was still processed with every key masked to -inf, so row_max == -INFINITY. Folding that into the online softmax computed expf(-inf - (-inf)) = NaN, and the next valid tile's 0*NaN correction then poisoned the whole row. Result: the gpt-oss prefill produced all-NaN logits for any query whose sliding window (128) starts past the first KV tile — i.e. at longer context — collapsing generation into a single repeated token (argmax of all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was the residual multi-turn/long-context collapse. Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes nothing to the softmax. The decode kernel already guards local_max == -INFINITY, so it was unaffected. Verified on dash5 (TP=2): the prefill that previously went all-NaN now produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long prior answer) completes correctly instead of emitting "!!!!". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 16:09:43 +08:00
Gahow Wang	9c98c169ff	kernels: flash attention with gpt-oss sinks + sliding window Add flash_attention_sinks_bf16 prefill kernel that folds the per-head attention sink into the softmax denominator (exactly as the decode sink kernel) and supports an optional sliding-window mask matching HF gpt-oss. Wire it through xserv-kernels (flash_attention_sinks) and use it in GptOss prefill, replacing the post-hoc sink approximation for an exact match against the reference math. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:10 +08:00
Gahow Wang	4c3f914459	kernels/cuda: paged-attention kernel, dispatch, pinned host memory CUDA layer for the paged-KV + swap work: - csrc: new paged_attention.cu plus updates across attention/gemm/norm/ activation/embedding/reduce kernels and common.cuh. - xserv-kernels: new dispatch module and kernel-binding updates. - xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap pool backing) and offset-aware D2H/H2D copies used to move KV blocks between the GPU pool and pinned host memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:36 +08:00
Gahow Wang	9783fcf410	phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm Three performance optimizations targeting decode throughput: 1. Decode Attention Kernel (csrc/attention/flash_attention.cu): - Specialized kernel for Q_len=1 (decode step) - 256 threads parallelize across KV sequence dimension - Online softmax with block-level warp-shuffle reduction - Replaces FA2 kernel which wasted 63/64 threads for decode - flash_attention() auto-dispatches when q_len==1 2. Fused SiLU×Mul (csrc/activation/activations.cu): - Single kernel: out = silu(gate) * up - Saves 1 HBM read + 1 HBM write per FFN layer (N elements) - Eliminates intermediate tensor allocation 3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu): - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual) - Saves 1 full HBM round-trip per attention block - Eliminates separate add + rmsnorm kernel pair Performance analysis: - At current short sequences (max 79 tokens), these optimizations provide marginal benefit because the bottleneck is cuBLAS GEMV overhead: 252 weight matrix reads × ~32MB each = 15.5 GB per decode step. Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap). - The fused kernels and decode attention will show larger gains at longer sequences where attention and element-wise ops dominate. - Next optimization target: CUDA Graphs to eliminate kernel launch overhead, or custom GEMV kernels to replace cuBLAS for M=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 19:40:56 +08:00
Gahow Wang	d67dda404e	phase 14: Flash Attention 2 for SM120 (RTX 5090) Implement Flash Attention 2 forward kernel targeting SM120 (CC 12.0). FA4 requires TMEM (only on data-center Blackwell SM100), so FA2 is the correct target for consumer Blackwell GPUs like the RTX 5090. CUDA kernel (csrc/attention/flash_attention.cu): - Online softmax with tiled Q/K/V — O(1) extra memory, no S×S matrix - Tile sizes: BR=BC=64, head_dim up to 128 (runtime parameter) - BF16 input, FP32 accumulation, BF16 output - Native GQA: kv_head = q_head / (num_q_heads / num_kv_heads) - Causal mask with tile-level skip optimization - Shared memory: 32 KB (Q_tile 16KB + KV_tile 16KB, fits in 48KB default) - Grid: (q_tiles, batch × num_q_heads), Block: 128 threads Integration: - flash_attention() Rust wrapper in xserv-kernels with shape/dtype validation - Qwen3 forward_gpu_cache uses flash_attention directly (no repeat_kv_gpu) - Eliminates repeat_kv memory allocation + copy per layer per step - Naive attention() preserved for testing/comparison Validated on dash5 (RTX 5090, CUDA 12.9): - Correctness: 9/10 top-1 match vs HF (identical to pre-FA baseline) - Throughput: 12.9 tok/s (up from 10.3, +25% improvement) - Now at 35% of HF transformers baseline (up from 30%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:27:39 +08:00

6 Commits