xserv

Author	SHA1	Message	Date
Gahow Wang	6da0972740	speculative: copy_kv_position primitive for tree drafting KV remap SGLang-style "write-all, copy-move on acceptance" approach: after tree verification, physically copy an accepted sibling's K/V from its physical cache slot to the canonical sequential position. New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu. For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16 elements in both K and V pools. Grid = num_kv_heads, block = head_dim. Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs. Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos, num_kv_heads, head_dim, block_size, stream). PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads block_ids for the sequence, calls the kernel per layer. This is the primitive needed by tree drafting: when a non-primary sibling at cache position P+2 is accepted as the "true" token for target position P+1, call copy_kv_position(slot, P+2, P+1) then truncate to P+2. Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.	2026-07-01 23:09:35 +08:00
Gahow Wang	fd392f7fbb	attention: tree-aware paged_decode_attention_tree kernel + wrapper New CUDA kernel paged_decode_attention_tree_bf16_kernel: same as base paged_decode_attention but with a per-query mask over the newly-written K/V region. `tree_mask[i][j] != 0` iff query i attends to newly-written K/V at slot j. Positions before `tree_start` are always attended. Motivation: speculative decoding with tree drafting needs siblings at the same target position to attend to their own branch's history, not each other's K/V. Rust binding: paged_decode_attention_tree(...) mirrors paged_decode_attention plus tree_mask_ptr, tree_start, tree_len. Forward path: Qwen3::forward_verify_paged_decode_attention_tree_with_hidden takes explicit positions, kv_lens, and a flattened [N*N] tree_mask. Sanity check: bench-eagle3's γ_multi path now routes through the tree kernel with a causal mask (mask[i][j]=1 iff j<=i), producing bit- equivalent output to the non-tree variant. matched=false pattern + acceptance rate + speedup all identical to previous run within noise (11.3% acceptance, 1.00× speedup with the mask-check overhead). --tree CLI flag is parsed but reserved. Real tree drafting (siblings sharing a target position) is blocked by KV cache position rigidity: paged_cache stores K/V at cache-position ≡ target-position, so an accepted sibling at target position P+1 has its K/V physically at cache position P+2 (its unique slot in the batched write). Continuing decode at P+1 would see the WRONG K/V (top-1 sibling's, not accepted top-2 sibling's). Fix requires either KV-slot remap on acceptance or a virtual position layer. Infrastructure is in place, next step is tackling that remap.	2026-07-01 20:45:55 +08:00
Gahow Wang	5f060902f6	cuda: fix remaining int32-address and nondeterministic-reduction bugs Three CUDA bugs from the review after `5b350ee` / `cfbd64d` that were missed by those commits: - flash_attention.cu decode_attention_bf16_kernel used atomicAdd to merge per-warp partials into smem_O — same nondeterminism pattern that `5b350ee` already fixed in paged_attention.cu and gemv.cu. This kernel is on the legacy forward_gpu_cache path plus the speculative bench baseline, so verify/decode parity depended on it. Replace with smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order. - causal_mask.cu computed the flat address as `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28 seq=32768 already overflows int32. Promote the index to long long. - quantization/dequant_fp8.cu had `int total = num_experts * rows * cols` and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows. Same fix pattern as the MoE dense kernels in `cfbd64d` — 64-bit total / idx / expert_stride, and grid computed in long long.	2026-07-01 15:13:07 +08:00
Gahow Wang	5b350ee5f0	cuda: deterministic BF16 gemv + paged attention reductions BF16 greedy decode was sensitive to inter-block scheduling when logits were close, which broke speculative-decoding verify-vs-decode parity. - gemv.cu: write per-K-block partials, then reduce in fixed block order in a second kernel instead of atomicAdd across K-blocks. Scratch buffer size is now n * ceil(k / GEMV_TILE_K); gemv_scratch_elems() exposes this to callers, and decode_graph.rs sizes fp32_hidden/q/kv/ intermediate/vocab from it. - paged_attention.cu: replace atomicAdd merge of warp outputs with per-warp shared partials reduced in warp-id order for both the base and sinks kernels.	2026-07-01 14:16:28 +08:00
Gahow Wang	5157b2cd30	kernels: fix NaN in flash-attention sinks on fully-masked window tiles flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the causal `continue`); an early tile entirely outside the sliding window was still processed with every key masked to -inf, so row_max == -INFINITY. Folding that into the online softmax computed expf(-inf - (-inf)) = NaN, and the next valid tile's 0*NaN correction then poisoned the whole row. Result: the gpt-oss prefill produced all-NaN logits for any query whose sliding window (128) starts past the first KV tile — i.e. at longer context — collapsing generation into a single repeated token (argmax of all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was the residual multi-turn/long-context collapse. Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes nothing to the softmax. The decode kernel already guards local_max == -INFINITY, so it was unaffected. Verified on dash5 (TP=2): the prefill that previously went all-NaN now produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long prior answer) completes correctly instead of emitting "!!!!". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 16:09:43 +08:00
Gahow Wang	9c98c169ff	kernels: flash attention with gpt-oss sinks + sliding window Add flash_attention_sinks_bf16 prefill kernel that folds the per-head attention sink into the softmax denominator (exactly as the decode sink kernel) and supports an optional sliding-window mask matching HF gpt-oss. Wire it through xserv-kernels (flash_attention_sinks) and use it in GptOss prefill, replacing the post-hoc sink approximation for an exact match against the reference math. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:10 +08:00
Gahow Wang	9ad91a4a92	phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2 Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params, 32 experts × top-4 routing). Key additions: - ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window, attention_bias, explicit head_dim, rope_scaling, swiglu_limit) - YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation and attention_scaling = 0.1*ln(factor)+1 - Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation) - Paged attention with sinks + sliding window kernel variant - GptOss model struct with expert-parallel TP (split 32 experts across ranks) - bench-gpt-oss binary for TP inference benchmarking Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT. Model generates topically-coherent output (needs chat template for quality). Known issues: - Custom GEMV kernel produces NaN with small N (workaround: pad to M=2) - Prefill doesn't use attention sinks (uses standard flash attention) - Output quality requires chat template formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:18:01 +08:00
Gahow Wang	13ae3de69e	kernels: reshape_and_cache, GPU argmax, single-launch GEMV Three new CUDA kernels and one rewrite: - reshape_and_cache: scatter K/V into paged pool in a single kernel per layer, replacing the Rust-side per-token per-head cudaMemcpy loop. Includes both single-sequence (prefill) and batched (decode) variants. - argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy decode now only D2H-transfers B×4 bytes (token ids) instead of the full [B, vocab] logits tensor. - GEMV rewrite: fused zero-init inside the K-split kernel eliminates the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:17 +08:00
Gahow Wang	4c3f914459	kernels/cuda: paged-attention kernel, dispatch, pinned host memory CUDA layer for the paged-KV + swap work: - csrc: new paged_attention.cu plus updates across attention/gemm/norm/ activation/embedding/reduce kernels and common.cuh. - xserv-kernels: new dispatch module and kernel-binding updates. - xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap pool backing) and offset-aware D2H/H2D copies used to move KV blocks between the GPU pool and pinned host memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:36 +08:00
Gahow Wang	986a289616	fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090 P0 fixes (blocking usability): - FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul) - FIX-16: EOS token no longer leaks into API responses - FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256) - FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400 P1 fixes (bugs & performance): - FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat) - FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety - FIX-09: tokenizer byte_fallback graceful degradation (was panic) - FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf) - FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm - FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches P2 fixes (improvements): - FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations - FIX-23: RoPE cache no longer artificially capped at 8192 positions Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:43 +08:00
Gahow Wang	9783fcf410	phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm Three performance optimizations targeting decode throughput: 1. Decode Attention Kernel (csrc/attention/flash_attention.cu): - Specialized kernel for Q_len=1 (decode step) - 256 threads parallelize across KV sequence dimension - Online softmax with block-level warp-shuffle reduction - Replaces FA2 kernel which wasted 63/64 threads for decode - flash_attention() auto-dispatches when q_len==1 2. Fused SiLU×Mul (csrc/activation/activations.cu): - Single kernel: out = silu(gate) * up - Saves 1 HBM read + 1 HBM write per FFN layer (N elements) - Eliminates intermediate tensor allocation 3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu): - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual) - Saves 1 full HBM round-trip per attention block - Eliminates separate add + rmsnorm kernel pair Performance analysis: - At current short sequences (max 79 tokens), these optimizations provide marginal benefit because the bottleneck is cuBLAS GEMV overhead: 252 weight matrix reads × ~32MB each = 15.5 GB per decode step. Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap). - The fused kernels and decode attention will show larger gains at longer sequences where attention and element-wise ops dominate. - Next optimization target: CUDA Graphs to eliminate kernel launch overhead, or custom GEMV kernels to replace cuBLAS for M=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 19:40:56 +08:00
Gahow Wang	d67dda404e	phase 14: Flash Attention 2 for SM120 (RTX 5090) Implement Flash Attention 2 forward kernel targeting SM120 (CC 12.0). FA4 requires TMEM (only on data-center Blackwell SM100), so FA2 is the correct target for consumer Blackwell GPUs like the RTX 5090. CUDA kernel (csrc/attention/flash_attention.cu): - Online softmax with tiled Q/K/V — O(1) extra memory, no S×S matrix - Tile sizes: BR=BC=64, head_dim up to 128 (runtime parameter) - BF16 input, FP32 accumulation, BF16 output - Native GQA: kv_head = q_head / (num_q_heads / num_kv_heads) - Causal mask with tile-level skip optimization - Shared memory: 32 KB (Q_tile 16KB + KV_tile 16KB, fits in 48KB default) - Grid: (q_tiles, batch × num_q_heads), Block: 128 threads Integration: - flash_attention() Rust wrapper in xserv-kernels with shape/dtype validation - Qwen3 forward_gpu_cache uses flash_attention directly (no repeat_kv_gpu) - Eliminates repeat_kv memory allocation + copy per layer per step - Naive attention() preserved for testing/comparison Validated on dash5 (RTX 5090, CUDA 12.9): - Correctness: 9/10 top-1 match vs HF (identical to pre-FA baseline) - Throughput: 12.9 tok/s (up from 10.3, +25% improvement) - Now at 35% of HF transformers baseline (up from 30%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:27:39 +08:00
Gahow Wang	6035ffdc0b	phase 5: naive multi-head attention - Batched GEMM via cublasGemmStridedBatchedEx - Causal mask CUDA kernel (F32 + BF16) - Element-wise scale CUDA kernel (F32 + BF16) - attention() composing: batched_matmul + scale + causal_mask + softmax - Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip) - 5 attention tests passing (max_err < 3e-7 F32) - Total: 61 tests passing across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:17:23 +08:00

13 Commits