xserv

Files

Gahow Wang 4368e79695 model: fused GPU MoE kernel — eliminate CPU roundtrip

Replace the per-token CPU-routed MoE forward with an all-GPU path:

  1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
  2. moe_replicate: broadcast input to all local experts
  3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
  4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)

Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).

Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.

GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-31 13:22:59 +08:00

src

model: fused GPU MoE kernel — eliminate CPU roundtrip

2026-05-31 13:22:59 +08:00

tests

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

build.rs

model: fused GPU MoE kernel — eliminate CPU roundtrip

2026-05-31 13:22:59 +08:00

Cargo.toml

phase 3: GEMM kernels (naive, tiled, cuBLAS)

2026-05-21 19:48:05 +08:00