xserv

Author	SHA1	Message	Date
Gahow Wang	6da0972740	speculative: copy_kv_position primitive for tree drafting KV remap SGLang-style "write-all, copy-move on acceptance" approach: after tree verification, physically copy an accepted sibling's K/V from its physical cache slot to the canonical sequential position. New CUDA kernel: copy_kv_position_kernel in reshape_and_cache.cu. For one token (src_pos → dst_pos), copies head_dim × num_kv_heads BF16 elements in both K and V pools. Grid = num_kv_heads, block = head_dim. Cost for one token across 36 layers: ~5.3 MB D2D copy @ 900 GB/s = <6μs. Rust FFI: copy_kv_position(k_pool, v_pool, block_ids, src_pos, dst_pos, num_kv_heads, head_dim, block_size, stream). PagedKVCache method: copy_kv_position(slot, src_pos, dst_pos) — uploads block_ids for the sequence, calls the kernel per layer. This is the primitive needed by tree drafting: when a non-primary sibling at cache position P+2 is accepted as the "true" token for target position P+1, call copy_kv_position(slot, P+2, P+1) then truncate to P+2. Next: wire into bench-eagle3 tree drafting loop with top-2 siblings.	2026-07-01 23:09:35 +08:00
Gahow Wang	fd392f7fbb	attention: tree-aware paged_decode_attention_tree kernel + wrapper New CUDA kernel paged_decode_attention_tree_bf16_kernel: same as base paged_decode_attention but with a per-query mask over the newly-written K/V region. `tree_mask[i][j] != 0` iff query i attends to newly-written K/V at slot j. Positions before `tree_start` are always attended. Motivation: speculative decoding with tree drafting needs siblings at the same target position to attend to their own branch's history, not each other's K/V. Rust binding: paged_decode_attention_tree(...) mirrors paged_decode_attention plus tree_mask_ptr, tree_start, tree_len. Forward path: Qwen3::forward_verify_paged_decode_attention_tree_with_hidden takes explicit positions, kv_lens, and a flattened [N*N] tree_mask. Sanity check: bench-eagle3's γ_multi path now routes through the tree kernel with a causal mask (mask[i][j]=1 iff j<=i), producing bit- equivalent output to the non-tree variant. matched=false pattern + acceptance rate + speedup all identical to previous run within noise (11.3% acceptance, 1.00× speedup with the mask-check overhead). --tree CLI flag is parsed but reserved. Real tree drafting (siblings sharing a target position) is blocked by KV cache position rigidity: paged_cache stores K/V at cache-position ≡ target-position, so an accepted sibling at target position P+1 has its K/V physically at cache position P+2 (its unique slot in the batched write). Continuing decode at P+1 would see the WRONG K/V (top-1 sibling's, not accepted top-2 sibling's). Fix requires either KV-slot remap on acceptance or a virtual position layer. Infrastructure is in place, next step is tackling that remap.	2026-07-01 20:45:55 +08:00
Gahow Wang	e5734b41fa	speculative: batched-GEMV kernel for verify path (Phase 24 step 1) Add launch_gemv_bf16_batched: runs M m=1 GEMVs in a single 3D grid launch (z = batch row) with numerically identical output to M sequential launch_gemv_bf16 calls — same K-block partial accumulation, same fixed-order reduction. Verified on dash5 with 10 prompts × 32 tokens: matched=true, verify_decode_mismatches=0. Expose as matmul_batched_gemv(a: [M,K], b: [K,N]) → [M,N] in xserv-kernels. Replace the old matmul_rows_gemv helper in qwen3 forward_verify_paged_decode_attention; the per-row loop over matmul_2d + concat_rows is replaced by a single matmul_batched_gemv call that allocates the partials buffer in one shot and launches 2 kernels instead of 2*M. Current speedup_e2e is 0.47× (same ballpark as Phase 23 0.44×); the batched launch saves ~3 ms overhead but this is small relative to the total 28 ms spec cost. The path forward (per docs/24 §4) is higher acceptance rate or cheaper draft, not further kernel optimization.	2026-07-01 16:13:37 +08:00
Gahow Wang	5f060902f6	cuda: fix remaining int32-address and nondeterministic-reduction bugs Three CUDA bugs from the review after `5b350ee` / `cfbd64d` that were missed by those commits: - flash_attention.cu decode_attention_bf16_kernel used atomicAdd to merge per-warp partials into smem_O — same nondeterminism pattern that `5b350ee` already fixed in paged_attention.cu and gemv.cu. This kernel is on the legacy forward_gpu_cache path plus the speculative bench baseline, so verify/decode parity depended on it. Replace with smem_O_warp[32][HEAD_DIM_MAX] partials reduced in fixed warp-id order. - causal_mask.cu computed the flat address as `batch_idx * rows * cols + row * cols + col` in int; batch=128 heads=28 seq=32768 already overflows int32. Promote the index to long long. - quantization/dequant_fp8.cu had `int total = num_experts * rows * cols` and `int expert_stride = rows * cols`; 32 experts × 8k × 8k overflows. Same fix pattern as the MoE dense kernels in `cfbd64d` — 64-bit total / idx / expert_stride, and grid computed in long long.	2026-07-01 15:13:07 +08:00
Gahow Wang	a67753f516	softmax: cap block size at 512 threads launch_softmax_{f32,bf16} clamped block to 1024 threads when cols was larger. Halving the ceiling to 512 keeps two blocks per SM resident on the large vocab kernels that dominate speculative verify workloads without changing rows/block indexing, and never exceeds cols.	2026-07-01 14:16:32 +08:00
Gahow Wang	5b350ee5f0	cuda: deterministic BF16 gemv + paged attention reductions BF16 greedy decode was sensitive to inter-block scheduling when logits were close, which broke speculative-decoding verify-vs-decode parity. - gemv.cu: write per-K-block partials, then reduce in fixed block order in a second kernel instead of atomicAdd across K-blocks. Scratch buffer size is now n * ceil(k / GEMV_TILE_K); gemv_scratch_elems() exposes this to callers, and decode_graph.rs sizes fp32_hidden/q/kv/ intermediate/vocab from it. - paged_attention.cu: replace atomicAdd merge of warp outputs with per-warp shared partials reduced in warp-id order for both the base and sinks kernels.	2026-07-01 14:16:28 +08:00
Gahow Wang	cfbd64d206	cuda: fix int32 overflow in MoE dense kernels; surface launch errors in release The dense MoE kernels (moe_replicate, moe_bias_add_3d, moe_weighted_sum) computed total / expert_stride / element indices in int32. gpt-oss prefill runs the whole prompt through the dense path unchunked (SPARSE_MAX_TOKENS=8), so local_expertsnum_tokenshidden (and batchnum_tokensdim, and local_id*expert_stride) overflow int32 at ~3.6k-23k prefill tokens (TP-dependent) — well inside the supported context window. The launch then fails silently because CUDA_CHECK_LAST_ERROR was ((void)0) under NDEBUG, so the bias / weighted-sum simply never runs and the forward pass is corrupted with no error reported. Fix: switch the three kernels and their launchers to long long, mirroring the (long long) indexing already used in moe_sparse.cu. Also make CUDA_CHECK_LAST_ERROR always-on — cudaGetLastError does not sync, so the per-launch host cost is negligible, and a silent launch failure is exactly the class of bug this one was. Verified on dash5 (RTX 5090): a direct kernel test at 2.21B elements (>2^31) for both moe_replicate and moe_bias_add_3d produces correct results with no launch error; bench-gpt-oss TP=2 holds at 5.9ms TPOT, output unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 12:37:21 +08:00
Gahow Wang	5343391dbd	review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings - --pp with gpt-oss now fails with a clear message instead of a cryptic missing-weight panic inside the Qwen3-only PP engine. - Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the uint4-vectorized kernels would silently drop a tail otherwise. - Document the topk_ids buffer holding i32 under an F32 dtype label (DType has no I32). - Drop unused imports/locals and the cuBLASLt scale-mode constants orphaned by the strided-batched FP8 rework (`e631a71`). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:02:59 +08:00
Gahow Wang	1897b2e17a	gpt-oss: drop debug syncs from forward; GPU broadcast bias-add Decode carried three leftover cudaDeviceSynchronize (prefill one) from NaN debugging — the Qwen3 path has none and the logits D2H in sample() already orders against the null stream. add_bias for rows>1 round-tripped the bias through the CPU (D2H + host tile loop + H2D) on every call — 96 times per prefill across q/k/v/o. Replace with a bias_add_2d broadcast kernel. dash5, FP8 TP=2, warm-server: TTFT 35/49/94 -> 29/42/79 ms (short/medium/long), TPOT 7.19-7.32 -> 6.99-7.21 ms. Greedy tokens unchanged; GSM8K-50 94%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:02:59 +08:00
Gahow Wang	fb20178992	moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2) Dense MoE replicated x across all 16 local experts and ran the full batched GEMM, reading every expert's weights per token; the weighted sum then discarded 12 of 16 results. Decode is memory-bound, so this was ~8x wasted expert bytes — the entire decode gap vs llama.cpp. New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read topk_ids on-device (no host sync) and early-return block-uniformly for experts other ranks own. FP8 runs W8A16 (activations stay BF16 — tensor cores are irrelevant at M=1, and activation quantization error disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the GEMV epilogue; slot-indexed weighted sum skips (never multiplies) unwritten non-local slots. Dense path retained for num_tokens > 8 (prefill) and via XSERV_DENSE_MOE=1 for A/B. dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms. Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms — first config where xserv wins decode outright. GSM8K-100: 96% (dense FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode CUDA graphs, non-expert quantization, sparse prefill (docs/20). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 16:29:10 +08:00
Gahow Wang	d33220498a	quantization: MXFP4 W4A16 expert weights (memory-optimization foundation) Weight-only 4-bit for the gpt-oss MoE experts: weights stored MXFP4 (E2M1 + per-32-element UE8M0 block scale, tools/quantize_mxfp4.py), a fused kernel reads the 4-bit weights and dequantizes on-chip to BF16. Decode (M=1) uses a fused dequant-GEMV (batched_gemv_mxfp4) with shared-memory activation tiling; prefill (M>1) dequantizes to BF16 then reuses the BF16 batched GEMM. MXFP4 is detected by the scale tensor's rank (3-D [E,N,K/32]) vs FP8's 1-D [E]. Verified on dash5 (gpt-oss-20b, TP=2, 5090): byte-identical greedy tokens to FP8/BF16, smallest footprint (13 GB vs 22 GB FP8, 39 GB BF16) — fits one 32 GB 5090 with room for KV cache. NOT a decode speedup: the hand-written W4A16 GEMV (no tensor cores) is less efficient than cuBLASLt's FP8 tensor-core GEMM, so even at half the weight bytes decode is 17.0 ms vs FP8 13.5 ms (faster than BF16 18.8 ms); prefill regresses (350 vs 134 ms, dequant fallback). Committed as a correct memory-optimization foundation. Beating FP8 on speed needs FP4 tensor cores (W4A4, cuBLASLt block-scaled MXFP4) or a Marlin-class kernel; see docs/benchmarks/mxfp4-and-llama-decode.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 15:01:42 +08:00
Gahow Wang	e631a71b68	quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48 The plan-cache fix removed the per-expert heuristic churn but still issued one cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul (BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token. A single strided call can't carry a per-batch scalar B-scale, so the per-expert weight scale moves out of the GEMM epilogue into a fused post-scale kernel (rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one pass. This is precision-equivalent: BF16's relative error is scale-invariant, so scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue. Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K: decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms), throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 01:23:29 +08:00
Gahow Wang	76487b7963	quantization: W8A8 FP8 compute via cuBLASLt tensor cores Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3) and activations (dynamically quantized per-row) are processed directly on FP8 tensor cores. Key implementation details: - cuBLASLt on Blackwell requires transA=T for FP8, so expert weights are transposed during model loading ([E,K,N] → [E,N,K]) - Per-row activation quantization kernel (absmax/448 → FP8 E4M3) - Post-GEMM row-wise rescaling recovers per-token precision - Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints The same FP8 quantized model files work — no re-quantization needed. Activation quantization happens dynamically at inference time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 20:38:26 +08:00
Gahow Wang	9f1fbbb98b	quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem) with per-expert FP32 scale factors. At inference, a fused CUDA kernel dequantizes to BF16 before the existing cuBLAS batched GEMM. Results on gpt-oss-20b (50-problem GSM8K subset): - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM) - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total) No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%). New files: - tools/quantize_fp8.py: offline BF16→FP8 conversion script - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 19:33:07 +08:00
Gahow Wang	3b9e32e6cd	kernels: fix uninitialized shared-memory read in M=1 decode GEMV gemv_bf16_fused_kernel returned early on out-of-range columns (`if (col >= N) return;`) BEFORE the cooperative load of x into shared memory and the `__syncthreads()`. When N is not a multiple of GEMV_TILE_N (128), the last column-block's out-of-range threads exited without loading their slice of x_shared, so the in-range threads then read uninitialized shared memory in the dot product — and __syncthreads with exited threads is itself UB. Result: intermittent huge/garbage outputs (~1e33) that, after the next RMSNorm, collapsed the whole forward pass to a degenerate logit distribution (argmax → vocab_size-1, or NaN), derailing generation. This hit every M=1 BF16 GEMV (n>=256) with n % 128 != 0 — i.e. gpt-oss decode o_proj and the MoE projections (n=2880). q/k/v (4096) and lm_head (201088) are 128-aligned and were unaffected, as is Qwen3 (hidden 4096), which is why this manifested as intermittent gpt-oss-only decode failures. Fix: all threads participate in the shared-memory load and reach the barrier; the col>=N check moves to AFTER __syncthreads. Verified on dash5 (TP=2): a prompt that reliably produced garbage ~70% of runs now yields clean logits 16/16; the multi-turn Chinese chat that collapsed mid-conversation completes coherently with 0 NaN warnings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 17:18:37 +08:00
Gahow Wang	5157b2cd30	kernels: fix NaN in flash-attention sinks on fully-masked window tiles flash_attention_sinks_bf16_kernel skipped only fully-future KV tiles (the causal `continue`); an early tile entirely outside the sliding window was still processed with every key masked to -inf, so row_max == -INFINITY. Folding that into the online softmax computed expf(-inf - (-inf)) = NaN, and the next valid tile's 0*NaN correction then poisoned the whole row. Result: the gpt-oss prefill produced all-NaN logits for any query whose sliding window (128) starts past the first KV tile — i.e. at longer context — collapsing generation into a single repeated token (argmax of all-NaN logits: vocab_size-1 in bench, token 0 "!" in the chat). This was the residual multi-turn/long-context collapse. Fix: skip a fully-masked tile (row_max == -INFINITY) — it contributes nothing to the softmax. The decode kernel already guards local_max == -INFINITY, so it was unaffected. Verified on dash5 (TP=2): the prefill that previously went all-NaN now produces clean logits; multi-turn gpt-oss chat (e.g. a haiku after a long prior answer) completes correctly instead of emitting "!!!!". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 16:09:43 +08:00
Gahow Wang	ae08896f46	xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug - Add ChatModel enum dispatching between Qwen3 and GptOss based on config.is_moe(), following the TP engine pattern. - Add --tp N flag for tensor-parallel inference (required for 39GB gpt-oss-20b which doesn't fit on a single 32GB GPU). - Add gpt-oss harmony chat template with channel/message format. - Replace hardcoded is_stop_token() with tokenizer.is_eos() for multi-model EOS support. - Restore gpt-oss hardcoded prompt template in server api.rs, lost during the Jinja template refactor. - Fix GEMV race condition: the K-split kernel zeroed the FP32 accumulator inside the kernel (block k=0) while other blocks atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead. - Update benchmark docs with post-fix results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-02 00:58:10 +08:00
Gahow Wang	4368e79695	model: fused GPU MoE kernel — eliminate CPU roundtrip Replace the per-token CPU-routed MoE forward with an all-GPU path: 1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax) 2. moe_replicate: broadcast input to all local experts 3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS) 4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU) Expert weights stored as contiguous 3D tensors for strided batched GEMM. Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer). Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight diagnostic at load time. GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-31 13:22:59 +08:00
Gahow Wang	9c98c169ff	kernels: flash attention with gpt-oss sinks + sliding window Add flash_attention_sinks_bf16 prefill kernel that folds the per-head attention sink into the softmax denominator (exactly as the decode sink kernel) and supports an optional sliding-window mask matching HF gpt-oss. Wire it through xserv-kernels (flash_attention_sinks) and use it in GptOss prefill, replacing the post-hoc sink approximation for an exact match against the reference math. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-31 00:56:10 +08:00
Gahow Wang	9ad91a4a92	phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2 Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params, 32 experts × top-4 routing). Key additions: - ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window, attention_bias, explicit head_dim, rope_scaling, swiglu_limit) - YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation and attention_scaling = 0.1*ln(factor)+1 - Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation) - Paged attention with sinks + sliding window kernel variant - GptOss model struct with expert-parallel TP (split 32 experts across ranks) - bench-gpt-oss binary for TP inference benchmarking Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT. Model generates topically-coherent output (needs chat template for quality). Known issues: - Custom GEMV kernel produces NaN with small N (workaround: pad to M=2) - Prefill doesn't use attention sinks (uses standard flash attention) - Output quality requires chat template formatting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:18:01 +08:00
Gahow Wang	13ae3de69e	kernels: reshape_and_cache, GPU argmax, single-launch GEMV Three new CUDA kernels and one rewrite: - reshape_and_cache: scatter K/V into paged pool in a single kernel per layer, replacing the Rust-side per-token per-head cudaMemcpy loop. Includes both single-sequence (prefill) and batched (decode) variants. - argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy decode now only D2H-transfers B×4 bytes (token ids) instead of the full [B, vocab] logits tensor. - GEMV rewrite: fused zero-init inside the K-split kernel eliminates the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 12:50:17 +08:00
Gahow Wang	4c3f914459	kernels/cuda: paged-attention kernel, dispatch, pinned host memory CUDA layer for the paged-KV + swap work: - csrc: new paged_attention.cu plus updates across attention/gemm/norm/ activation/embedding/reduce kernels and common.cuh. - xserv-kernels: new dispatch module and kernel-binding updates. - xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap pool backing) and offset-aware D2H/H2D copies used to move KV blocks between the GPU pool and pinned host memory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:58:36 +08:00
Gahow Wang	986a289616	fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090 P0 fixes (blocking usability): - FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul) - FIX-16: EOS token no longer leaks into API responses - FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256) - FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400 P1 fixes (bugs & performance): - FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat) - FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety - FIX-09: tokenizer byte_fallback graceful degradation (was panic) - FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf) - FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm - FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches P2 fixes (improvements): - FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations - FIX-23: RoPE cache no longer artificially capped at 8192 positions Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 14:13:43 +08:00
Gahow Wang	e207523e21	phase 15: custom GEMV kernel — 46.6 tok/s serial (3.5x improvement, 130% of HF) Custom bandwidth-optimized GEMV kernel for M=1 BF16 decode, replacing cuBLAS which achieves only ~8% bandwidth utilization for tiny M=1 GEMMs. Kernel design (csrc/gemm/gemv.cu): - K-split tiled: TILE_N=128, TILE_K=256, Grid=(N/128, K/256)=512 blocks - High occupancy: 512 blocks / 170 SMs = ~3 blocks/SM - Coalesced memory access: adjacent threads read adjacent columns of W - Shared memory for x vector (avoids redundant global reads) - FP32 accumulation via atomicAdd (K-split partial sums) - Separate fp32→bf16 conversion kernel Integration: - matmul() auto-dispatches to custom GEMV when M==1 && dtype==BF16 - Batched decode (M>1) continues to use cuBLAS - Caching allocator provides FP32 temp buffer (pooled, no per-call malloc) Ablation results (dash5, RTX 5090, Qwen3-8B BF16): \| Config \| tok/s \| vs HF (36) \| vs roofline (112) \| \|--------\|-------\|-----------\|-------------------\| \| Phase 14 (cuBLAS M=1) \| 13.2 \| 37% \| 12% \| \| + Custom GEMV (M=1) \| 46.6 \| 130% \| 42% \| \| Concurrent batch=4 \| 28.2 \| 78% \| — \| Single-request throughput now EXCEEDS HuggingFace transformers by 30%. The custom GEMV achieves ~42% of the theoretical roofline (vs 12% before). Note: concurrent batch=4 (28.2 tok/s) is slower than serial (46.6 tok/s) because the per-seq attention/reshape overhead in batched decode outweighs the cuBLAS M=4 benefit when the custom GEMV already handles M=1 efficiently. Engine should prefer serial decode when custom GEMV is available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 22:22:31 +08:00
Gahow Wang	9783fcf410	phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm Three performance optimizations targeting decode throughput: 1. Decode Attention Kernel (csrc/attention/flash_attention.cu): - Specialized kernel for Q_len=1 (decode step) - 256 threads parallelize across KV sequence dimension - Online softmax with block-level warp-shuffle reduction - Replaces FA2 kernel which wasted 63/64 threads for decode - flash_attention() auto-dispatches when q_len==1 2. Fused SiLU×Mul (csrc/activation/activations.cu): - Single kernel: out = silu(gate) * up - Saves 1 HBM read + 1 HBM write per FFN layer (N elements) - Eliminates intermediate tensor allocation 3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu): - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual) - Saves 1 full HBM round-trip per attention block - Eliminates separate add + rmsnorm kernel pair Performance analysis: - At current short sequences (max 79 tokens), these optimizations provide marginal benefit because the bottleneck is cuBLAS GEMV overhead: 252 weight matrix reads × ~32MB each = 15.5 GB per decode step. Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap). - The fused kernels and decode attention will show larger gains at longer sequences where attention and element-wise ops dominate. - Next optimization target: CUDA Graphs to eliminate kernel launch overhead, or custom GEMV kernels to replace cuBLAS for M=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 19:40:56 +08:00
Gahow Wang	d67dda404e	phase 14: Flash Attention 2 for SM120 (RTX 5090) Implement Flash Attention 2 forward kernel targeting SM120 (CC 12.0). FA4 requires TMEM (only on data-center Blackwell SM100), so FA2 is the correct target for consumer Blackwell GPUs like the RTX 5090. CUDA kernel (csrc/attention/flash_attention.cu): - Online softmax with tiled Q/K/V — O(1) extra memory, no S×S matrix - Tile sizes: BR=BC=64, head_dim up to 128 (runtime parameter) - BF16 input, FP32 accumulation, BF16 output - Native GQA: kv_head = q_head / (num_q_heads / num_kv_heads) - Causal mask with tile-level skip optimization - Shared memory: 32 KB (Q_tile 16KB + KV_tile 16KB, fits in 48KB default) - Grid: (q_tiles, batch × num_q_heads), Block: 128 threads Integration: - flash_attention() Rust wrapper in xserv-kernels with shape/dtype validation - Qwen3 forward_gpu_cache uses flash_attention directly (no repeat_kv_gpu) - Eliminates repeat_kv memory allocation + copy per layer per step - Naive attention() preserved for testing/comparison Validated on dash5 (RTX 5090, CUDA 12.9): - Correctness: 9/10 top-1 match vs HF (identical to pre-FA baseline) - Throughput: 12.9 tok/s (up from 10.3, +25% improvement) - Now at 35% of HF transformers baseline (up from 30%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 18:27:39 +08:00
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	2be27d6d94	perf: GPU transpose/reshape/repeat_kv kernels (eliminate CPU round-trips) New CUDA kernels (csrc/embedding/transpose.cu): - reshape_heads_bf16: [S, HD] → [1, H, S, D] - merge_heads_bf16: [1, H, S, D] → [S, HD] - transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE) - transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE) - repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D] Rust wrappers (xserv-kernels/src/transpose.rs): - reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips. Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms) Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle creations per token (Phase 15 targets) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 12:01:07 +08:00
Gahow Wang	be5c64ea8a	phase 10: GPU add/mul kernels + BF16 precision analysis Kernel additions: - add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU) - Refactored activation.rs with dispatch_unary/dispatch_binary helpers - Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add) BF16 precision analysis (docs/benchmarks/phase10-qwen3.md): - Root cause: separate attention kernels materialize BF16 intermediates (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path) - HF itself SDPA vs Eager also differs by ~0.125 logit - xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases - Industry standard for BF16: top-5 overlap (we achieve 100%) - Fix path: Flash Attention (Phase 14) to fuse attention in FP32 Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 11:35:26 +08:00
Gahow Wang	6035ffdc0b	phase 5: naive multi-head attention - Batched GEMM via cublasGemmStridedBatchedEx - Causal mask CUDA kernel (F32 + BF16) - Element-wise scale CUDA kernel (F32 + BF16) - attention() composing: batched_matmul + scale + causal_mask + softmax - Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip) - 5 attention tests passing (max_err < 3e-7 F32) - Total: 61 tests passing across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:17:23 +08:00
Gahow Wang	c8e8153702	phase 4: transformer core kernels CUDA kernels (csrc/): - common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max - normalization/rmsnorm.cu: RMSNorm (F32 + BF16) - normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16) - activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16) - reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16) - embedding/embedding.cu: gather lookup (F32 + BF16) - embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16) Rust wrappers (xserv-kernels/src/): - rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs - RopeCache struct with GPU-side precomputation Tests: 12 new tests (ops_test.rs), all passing with good precision: - F32: max_err 1e-6 ~ 1e-9 - BF16: max_err 2e-3 ~ 7e-3 Total: 29 kernel tests + 27 prior = 56 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:07:24 +08:00
Gahow Wang	d77f921a12	phase 3: GEMM kernels (naive, tiled, cuBLAS) - Naive GEMM kernel: one thread per output element (F32 + BF16) - Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16) - cuBLAS wrapper: cublasGemmEx with row-major trick - GemmBackend enum for runtime backend selection - CublasContext RAII handle - Made error::check public for cross-crate use - 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16 - Cross-backend consistency verified (naive vs tiled vs cuBLAS) - All 44 tests pass across all crates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 19:48:05 +08:00
Gahow Wang	9806b4db35	phase 0+1: project scaffold + xserv-cuda crate - Cargo workspace with xserv-cuda crate - CUDA FFI bindings (cudart: memory, stream, device, error) - GpuBuffer RAII wrapper with H2D/D2H/D2D copy - CudaStream wrapper with RAII Drop - CachingAllocator with size-bucketed free lists - PinnedBuffer for page-locked host memory - Device info query via cudaDeviceGetAttribute - Vector-add CUDA kernel smoke test - Integration test suite (11 tests) - build.rs: cc crate compiles .cu for SM 12.0 - sync-and-build.sh for remote build on dash5 - Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 18:40:22 +08:00

33 Commits