xserv

Files

Gahow Wang 4368e79695 model: fused GPU MoE kernel — eliminate CPU roundtrip

Replace the per-token CPU-routed MoE forward with an all-GPU path:

  1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
  2. moe_replicate: broadcast input to all local experts
  3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
  4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)

Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).

Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.

GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-31 13:22:59 +08:00

activation

phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2

2026-05-30 15:18:01 +08:00

attention

kernels: flash attention with gpt-oss sinks + sliding window

2026-05-31 00:56:10 +08:00

embedding

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

gemm

kernels: reshape_and_cache, GPU argmax, single-launch GEMV