Dense MoE replicated x across all 16 local experts and ran the full batched GEMM, reading every expert's weights per token; the weighted sum then discarded 12 of 16 results. Decode is memory-bound, so this was ~8x wasted expert bytes — the entire decode gap vs llama.cpp. New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read topk_ids on-device (no host sync) and early-return block-uniformly for experts other ranks own. FP8 runs W8A16 (activations stay BF16 — tensor cores are irrelevant at M=1, and activation quantization error disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the GEMV epilogue; slot-indexed weighted sum skips (never multiplies) unwritten non-local slots. Dense path retained for num_tokens > 8 (prefill) and via XSERV_DENSE_MOE=1 for A/B. dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms. Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms — first config where xserv wins decode outright. GSM8K-100: 96% (dense FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode CUDA graphs, non-expert quantization, sparse prefill (docs/20). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
42 lines
1.6 KiB
Rust
42 lines
1.6 KiB
Rust
use std::env;
|
|
|
|
fn main() {
|
|
let cuda_path = env::var("CUDA_HOME")
|
|
.or_else(|_| env::var("CUDA_PATH"))
|
|
.unwrap_or_else(|_| "/usr/local/cuda".to_string());
|
|
|
|
println!("cargo:rustc-link-search=native={cuda_path}/lib64");
|
|
println!("cargo:rustc-link-lib=dylib=cudart");
|
|
println!("cargo:rustc-link-lib=dylib=cublas");
|
|
println!("cargo:rustc-link-lib=dylib=cublasLt");
|
|
|
|
cc::Build::new()
|
|
.cuda(true)
|
|
.cudart("shared")
|
|
.flag("-gencode=arch=compute_120,code=sm_120")
|
|
.include("../../csrc")
|
|
.file("../../csrc/gemm/naive.cu")
|
|
.file("../../csrc/gemm/tiled.cu")
|
|
.file("../../csrc/gemm/gemv.cu")
|
|
.file("../../csrc/normalization/rmsnorm.cu")
|
|
.file("../../csrc/normalization/layernorm.cu")
|
|
.file("../../csrc/activation/activations.cu")
|
|
.file("../../csrc/reduce/softmax.cu")
|
|
.file("../../csrc/reduce/argmax.cu")
|
|
.file("../../csrc/embedding/embedding.cu")
|
|
.file("../../csrc/embedding/rope.cu")
|
|
.file("../../csrc/attention/causal_mask.cu")
|
|
.file("../../csrc/embedding/transpose.cu")
|
|
.file("../../csrc/attention/flash_attention.cu")
|
|
.file("../../csrc/attention/paged_attention.cu")
|
|
.file("../../csrc/attention/reshape_and_cache.cu")
|
|
.file("../../csrc/moe/moe_kernels.cu")
|
|
.file("../../csrc/moe/moe_sparse.cu")
|
|
.file("../../csrc/quantization/dequant_fp8.cu")
|
|
.file("../../csrc/quantization/quantize_fp8.cu")
|
|
.file("../../csrc/quantization/mxfp4_gemm.cu")
|
|
.compile("xserv_kernels");
|
|
|
|
println!("cargo:rerun-if-changed=../../csrc/");
|
|
}
|