moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2)

Dense MoE replicated x across all 16 local experts and ran the full batched GEMM, reading every expert's weights per token; the weighted sum then discarded 12 of 16 results. Decode is memory-bound, so this was ~8x wasted expert bytes — the entire decode gap vs llama.cpp. New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read topk_ids on-device (no host sync) and early-return block-uniformly for experts other ranks own. FP8 runs W8A16 (activations stay BF16 — tensor cores are irrelevant at M=1, and activation quantization error disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the GEMV epilogue; slot-indexed weighted sum skips (never multiplies) unwritten non-local slots. Dense path retained for num_tokens > 8 (prefill) and via XSERV_DENSE_MOE=1 for A/B. dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms. Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms — first config where xserv wins decode outright. GSM8K-100: 96% (dense FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode CUDA graphs, non-expert quantization, sparse prefill (docs/20). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 16:29:10 +08:00
parent cf1e9e41db
commit fb20178992
6 changed files with 692 additions and 0 deletions
--- a/crates/xserv-kernels/build.rs
+++ b/crates/xserv-kernels/build.rs
@@ -31,6 +31,7 @@ fn main() {
        .file("../../csrc/attention/paged_attention.cu")
        .file("../../csrc/attention/reshape_and_cache.cu")
        .file("../../csrc/moe/moe_kernels.cu")
+        .file("../../csrc/moe/moe_sparse.cu")
        .file("../../csrc/quantization/dequant_fp8.cu")
        .file("../../csrc/quantization/quantize_fp8.cu")
        .file("../../csrc/quantization/mxfp4_gemm.cu")
--- a/crates/xserv-kernels/src/moe.rs
+++ b/crates/xserv-kernels/src/moe.rs
@@ -29,6 +29,29 @@ unsafe extern "C" {
        stream: *mut c_void,
    );

+    fn launch_moe_sparse_gemv_fp8_bf16(
+        x: *const c_void, w: *const c_void, w_scales: *const c_void,
+        bias: *const c_void, topk_ids: *const c_void, y: *mut c_void,
+        num_tokens: i32, n: i32, k: i32, top_k: i32,
+        expert_start: i32, local_experts: i32, x_per_slot: i32,
+        stream: *mut c_void,
+    );
+    fn launch_moe_sparse_gemv_mxfp4_bf16(
+        x: *const c_void, w_packed: *const c_void, w_scales: *const c_void,
+        bias: *const c_void, topk_ids: *const c_void, y: *mut c_void,
+        num_tokens: i32, n: i32, k: i32, top_k: i32,
+        expert_start: i32, local_experts: i32, x_per_slot: i32,
+        stream: *mut c_void,
+    );
+    fn launch_moe_weighted_sum_sparse_bf16(
+        down: *const c_void,
+        topk_ids: *const c_void, topk_weights: *const c_void,
+        out: *mut c_void,
+        num_tokens: i32, hidden: i32, top_k: i32,
+        expert_start: i32, local_experts: i32,
+        stream: *mut c_void,
+    );
+
    fn cublasGemmStridedBatchedEx(
        handle: CublasHandle,
        transa: i32, transb: i32,
@@ -158,6 +181,110 @@ pub fn moe_weighted_sum(
    out
 }

+/// Sparse MoE GEMV (FP8 W8A16): compute only the routed experts.
+///
+/// x:        [num_tokens, K] BF16 (x_per_slot=false, gate_up) or
+///           [num_tokens * top_k, K] BF16 (x_per_slot=true, down)
+/// w_fp8_t:  [local_experts, N, K] FP8E4M3 (transposed weight layout)
+/// w_scales: [local_experts] F32 per-expert scalar scales
+/// bias:     [local_experts, N] BF16 (fused into the epilogue)
+/// topk_ids: [num_tokens, top_k] i32 global expert ids (GPU)
+///
+/// Returns y [num_tokens, top_k, N] BF16. Slots routed to experts NOT
+/// owned by this rank are left UNWRITTEN (uninitialized memory) — the
+/// consumer must skip them (see moe_weighted_sum_sparse).
+#[allow(clippy::too_many_arguments)]
+pub fn moe_sparse_gemv_fp8(
+    x: &Tensor, w_fp8_t: &Tensor, w_scales: &Tensor, bias: &Tensor,
+    topk_ids: &Tensor, num_tokens: usize, top_k: usize,
+    expert_start: usize, local_experts: usize, x_per_slot: bool,
+) -> Tensor {
+    assert_eq!(x.dtype(), DType::BF16);
+    assert!(x.is_contiguous());
+    let n = w_fp8_t.shape()[1];
+    let k = w_fp8_t.shape()[2];
+    assert_eq!(x.shape()[x.ndim() - 1], k);
+    assert_eq!(x.shape()[0], if x_per_slot { num_tokens * top_k } else { num_tokens });
+
+    let y = Tensor::empty(&[num_tokens, top_k, n], DType::BF16, x.device());
+    unsafe {
+        launch_moe_sparse_gemv_fp8_bf16(
+            x.data_ptr() as *const c_void,
+            w_fp8_t.data_ptr() as *const c_void,
+            w_scales.data_ptr() as *const c_void,
+            bias.data_ptr() as *const c_void,
+            topk_ids.data_ptr() as *const c_void,
+            y.data_ptr() as *mut c_void,
+            num_tokens as i32, n as i32, k as i32, top_k as i32,
+            expert_start as i32, local_experts as i32, x_per_slot as i32,
+            std::ptr::null_mut(),
+        );
+    }
+    y
+}
+
+/// Sparse MoE GEMV (MXFP4 W4A16): same contract as moe_sparse_gemv_fp8,
+/// with packed 4-bit weights [E, N, K/2] + UE8M0 block scales [E, N, K/32].
+#[allow(clippy::too_many_arguments)]
+pub fn moe_sparse_gemv_mxfp4(
+    x: &Tensor, w_packed: &Tensor, w_scales: &Tensor, bias: &Tensor,
+    topk_ids: &Tensor, num_tokens: usize, top_k: usize, n: usize, k: usize,
+    expert_start: usize, local_experts: usize, x_per_slot: bool,
+) -> Tensor {
+    assert_eq!(x.dtype(), DType::BF16);
+    assert!(x.is_contiguous());
+    assert_eq!(x.shape()[x.ndim() - 1], k);
+    assert_eq!(x.shape()[0], if x_per_slot { num_tokens * top_k } else { num_tokens });
+
+    let y = Tensor::empty(&[num_tokens, top_k, n], DType::BF16, x.device());
+    unsafe {
+        launch_moe_sparse_gemv_mxfp4_bf16(
+            x.data_ptr() as *const c_void,
+            w_packed.data_ptr() as *const c_void,
+            w_scales.data_ptr() as *const c_void,
+            bias.data_ptr() as *const c_void,
+            topk_ids.data_ptr() as *const c_void,
+            y.data_ptr() as *mut c_void,
+            num_tokens as i32, n as i32, k as i32, top_k as i32,
+            expert_start as i32, local_experts as i32, x_per_slot as i32,
+            std::ptr::null_mut(),
+        );
+    }
+    y
+}
+
+/// Weighted sum over the slot axis of the sparse GEMV output.
+///
+/// down: [num_tokens, top_k, hidden] BF16 (non-local slots uninitialized
+/// and skipped, never multiplied by zero — NaN * 0 = NaN).
+pub fn moe_weighted_sum_sparse(
+    down: &Tensor,
+    topk_ids: &Tensor,
+    topk_weights: &Tensor,
+    expert_start: usize,
+    local_experts: usize,
+) -> Tensor {
+    assert_eq!(down.ndim(), 3);
+    assert_eq!(down.dtype(), DType::BF16);
+    let num_tokens = down.shape()[0];
+    let top_k = down.shape()[1];
+    let hidden = down.shape()[2];
+
+    let out = Tensor::empty(&[num_tokens, hidden], DType::BF16, down.device());
+    unsafe {
+        launch_moe_weighted_sum_sparse_bf16(
+            down.data_ptr() as *const c_void,
+            topk_ids.data_ptr() as *const c_void,
+            topk_weights.data_ptr() as *const c_void,
+            out.data_ptr() as *mut c_void,
+            num_tokens as i32, hidden as i32, top_k as i32,
+            expert_start as i32, local_experts as i32,
+            std::ptr::null_mut(),
+        );
+    }
+    out
+}
+
 /// Strided batched GEMM for MoE expert forward.
 /// C[b] = A[b] @ B[b]  for b in 0..batch
 ///
--- a/crates/xserv-model/src/gpt_oss.rs
+++ b/crates/xserv-model/src/gpt_oss.rs
@@ -549,6 +549,60 @@ impl GptOss {
            &router_logits, num_experts, top_k,
        );

+        // Sparse decode path: compute ONLY the routed experts. The dense path
+        // below reads every local expert's weights per forward; the sparse
+        // GEMVs read ~top_k/num_experts of the bytes, which dominates decode
+        // (memory-bound). Dense reads each weight once for ALL tokens, so it
+        // wins back at num_tokens ≈ local_experts / E[local hits] ≈ 8.
+        const SPARSE_MAX_TOKENS: usize = 8;
+        let quantized = layer.expert_gate_up_fp8.is_some() || layer.expert_gate_up_mxfp4.is_some();
+        if num_tokens <= SPARSE_MAX_TOKENS && quantized && !dense_moe_forced() {
+            let gate_up = if let Some((ref packed, ref scales)) = layer.expert_gate_up_mxfp4 {
+                let n = packed.shape()[1];
+                let k = packed.shape()[2] * 2;
+                xserv_kernels::moe::moe_sparse_gemv_mxfp4(
+                    x, packed, scales, &layer.expert_gate_up_bias, &topk_ids,
+                    num_tokens, top_k, n, k, expert_start, local_experts, false,
+                )
+            } else {
+                xserv_kernels::moe::moe_sparse_gemv_fp8(
+                    x, layer.expert_gate_up_fp8.as_ref().unwrap(),
+                    layer.expert_gate_up_scale.as_ref().unwrap(),
+                    &layer.expert_gate_up_bias, &topk_ids,
+                    num_tokens, top_k, expert_start, local_experts, false,
+                )
+            };
+
+            // GLU over all slots. Non-local slots hold unwritten memory; they
+            // are never consumed (the down GEMV and the weighted sum both skip
+            // slots whose expert this rank does not own).
+            let inter2 = gate_up.shape()[2];
+            let gate_up_flat = gate_up.reshape(&[num_tokens * top_k, inter2]);
+            let activated = gpt_oss_glu(&gate_up_flat, layer.glu_alpha, layer.glu_limit);
+
+            let down = if let Some((ref packed, ref scales)) = layer.expert_down_mxfp4 {
+                let n = packed.shape()[1];
+                let k = packed.shape()[2] * 2;
+                xserv_kernels::moe::moe_sparse_gemv_mxfp4(
+                    &activated, packed, scales, &layer.expert_down_bias, &topk_ids,
+                    num_tokens, top_k, n, k, expert_start, local_experts, true,
+                )
+            } else {
+                xserv_kernels::moe::moe_sparse_gemv_fp8(
+                    &activated, layer.expert_down_fp8.as_ref().unwrap(),
+                    layer.expert_down_scale.as_ref().unwrap(),
+                    &layer.expert_down_bias, &topk_ids,
+                    num_tokens, top_k, expert_start, local_experts, true,
+                )
+            };
+
+            let moe_out = xserv_kernels::moe::moe_weighted_sum_sparse(
+                &down, &topk_ids, &topk_weights, expert_start, local_experts,
+            );
+            self.all_reduce(&moe_out);
+            return moe_out;
+        }
+
        // 3. Replicate input: [tokens, hidden] → [local_experts, tokens, hidden]
        let x_rep = xserv_kernels::moe::moe_replicate(x, local_experts);

@@ -625,6 +679,12 @@ impl GptOss {

 // --- Helpers ---

+/// XSERV_DENSE_MOE=1 forces the dense all-expert path (A/B benchmarking).
+fn dense_moe_forced() -> bool {
+    static FORCED: std::sync::OnceLock<bool> = std::sync::OnceLock::new();
+    *FORCED.get_or_init(|| std::env::var("XSERV_DENSE_MOE").is_ok_and(|v| v != "0"))
+}
+
 fn matmul_2d(a: &Tensor, b: &Tensor) -> Tensor {
    assert_eq!(a.ndim(), 2);
    assert_eq!(b.ndim(), 2);
--- a/csrc/moe/moe_sparse.cu
+++ b/csrc/moe/moe_sparse.cu
@@ -0,0 +1,254 @@
+#include <cuda_bf16.h>
+#include <cuda_fp8.h>
+#include <cstdint>
+#include "../common.cuh"
+
+// ============================================================
+// Sparse MoE decode GEMVs — compute ONLY the routed experts.
+//
+// The dense path replicates x across all local experts and runs a
+// batched GEMM, reading every expert's weights per token. Decode is
+// memory-bound, so reading only the top-k routed experts' weights
+// (~2 of 16 local on average at TP=2) is a ~8x byte reduction.
+//
+// Each block handles one (token, slot) pair's tile of output columns.
+// It reads topk_ids[token, slot] from device memory (no host sync),
+// and exits early if the expert is not owned by this rank. The early
+// return is BLOCK-UNIFORM (every thread sees the same topk_ids value
+// and returns before the shared-memory staging + __syncthreads), so
+// it is safe — unlike the divergent-return bug fixed in gemv.cu.
+//
+// Outputs for non-local slots are NEVER written (uninitialized memory,
+// possibly NaN bit patterns). Downstream consumers must SKIP non-local
+// slots rather than multiply by zero (NaN * 0 = NaN).
+//
+// Per-expert weight scale and bias are fused into the epilogue:
+//   y[t, slot, n] = acc * w_scale[lid] + bias[lid, n]
+// which matches the dense path's GEMM -> moe_bias_add_3d sequence.
+//
+// Activation addressing (x_per_slot):
+//   gate_up: all slots of a token share x[token, :]        (x_per_slot=0)
+//   down:    each slot has its own activation row
+//            x[token * top_k + slot, :]                    (x_per_slot=1)
+// ============================================================
+
+#define SPARSE_TILE_N 8  // output columns per block (= warps per block)
+
+// Weights FP8 E4M3 [local_experts, N, K], activations BF16 (W8A16).
+// Decode is memory-bound (~2 FLOP/byte), so dequant-in-registers GEMV
+// loses nothing to tensor cores and skips activation quantization.
+__global__ void moe_sparse_gemv_fp8_bf16_kernel(
+    const __nv_bfloat16* __restrict__ x,     // [T, K] or [T*top_k, K]
+    const __nv_fp8_e4m3* __restrict__ w,     // [local_experts, N, K]
+    const float* __restrict__ w_scales,      // [local_experts]
+    const __nv_bfloat16* __restrict__ bias,  // [local_experts, N]
+    const int* __restrict__ topk_ids,        // [T, top_k] global expert ids
+    __nv_bfloat16* __restrict__ y,           // [T, top_k, N]
+    int N, int K, int top_k,
+    int expert_start, int local_experts,
+    int x_per_slot
+) {
+    int token = blockIdx.z;
+    int slot = blockIdx.y;
+    int eid = topk_ids[token * top_k + slot];
+    int lid = eid - expert_start;
+    if (lid < 0 || lid >= local_experts) return;  // block-uniform: safe
+
+    extern __shared__ float xs[];  // [K] activation row as float
+    const __nv_bfloat16* xrow =
+        x + (long long)(x_per_slot ? token * top_k + slot : token) * K;
+    for (int i = threadIdx.x; i < K; i += blockDim.x) {
+        xs[i] = __bfloat162float(xrow[i]);
+    }
+    __syncthreads();
+
+    int n = blockIdx.x * SPARSE_TILE_N + (threadIdx.x >> 5);
+    if (n >= N) return;  // after __syncthreads: safe
+    int lane = threadIdx.x & 31;
+
+    // One warp per output column; uint4 = 16 FP8 weights per lane, the
+    // warp covers 512 contiguous bytes per iteration (coalesced).
+    const uint8_t* wrow = (const uint8_t*)w + ((long long)lid * N + n) * K;
+    float acc = 0.0f;
+    for (int i = lane; i < (K >> 4); i += 32) {
+        uint4 packed = *(const uint4*)(wrow + (long long)i * 16);
+        const __nv_fp8_e4m3* pw = (const __nv_fp8_e4m3*)&packed;
+        const float* xk = xs + i * 16;
+        #pragma unroll
+        for (int j = 0; j < 16; j++) {
+            acc += xk[j] * float(pw[j]);
+        }
+    }
+
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1) {
+        acc += __shfl_down_sync(0xffffffffu, acc, o);
+    }
+    if (lane == 0) {
+        float v = acc * w_scales[lid]
+                + __bfloat162float(bias[(long long)lid * N + n]);
+        y[((long long)token * top_k + slot) * N + n] = __float2bfloat16(v);
+    }
+}
+
+// MXFP4 W4A16 variant: packed E2M1 nibbles + per-32 UE8M0 block scale,
+// same structure as batched_gemv_mxfp4_bf16_kernel but expert-indexed
+// via topk_ids and with fused per-expert bias.
+#define MXFP4_BLOCK 32
+
+__device__ __constant__ float kSparseFp4Levels[8] =
+    {0.f, 0.5f, 1.f, 1.5f, 2.f, 3.f, 4.f, 6.f};
+
+__device__ __forceinline__ float sparse_fp4_to_float(uint8_t code) {
+    float mag = kSparseFp4Levels[code & 0x7];
+    return (code & 0x8) ? -mag : mag;
+}
+
+__global__ void moe_sparse_gemv_mxfp4_bf16_kernel(
+    const __nv_bfloat16* __restrict__ x,     // [T, K] or [T*top_k, K]
+    const uint8_t* __restrict__ w_packed,    // [local_experts, N, K/2]
+    const uint8_t* __restrict__ w_scales,    // [local_experts, N, K/32]
+    const __nv_bfloat16* __restrict__ bias,  // [local_experts, N]
+    const int* __restrict__ topk_ids,        // [T, top_k]
+    __nv_bfloat16* __restrict__ y,           // [T, top_k, N]
+    int N, int K, int top_k,
+    int expert_start, int local_experts,
+    int x_per_slot
+) {
+    int token = blockIdx.z;
+    int slot = blockIdx.y;
+    int eid = topk_ids[token * top_k + slot];
+    int lid = eid - expert_start;
+    if (lid < 0 || lid >= local_experts) return;  // block-uniform: safe
+
+    extern __shared__ float xs[];
+    const __nv_bfloat16* xrow =
+        x + (long long)(x_per_slot ? token * top_k + slot : token) * K;
+    for (int i = threadIdx.x; i < K; i += blockDim.x) {
+        xs[i] = __bfloat162float(xrow[i]);
+    }
+    __syncthreads();
+
+    int n = blockIdx.x * SPARSE_TILE_N + (threadIdx.x >> 5);
+    if (n >= N) return;
+    int lane = threadIdx.x & 31;
+    int nblk = K / MXFP4_BLOCK;
+
+    const uint8_t* wp = w_packed + ((long long)lid * N + n) * (K >> 1);
+    const uint8_t* ws = w_scales + ((long long)lid * N + n) * nblk;
+
+    float acc = 0.0f;
+    for (int blk = lane; blk < nblk; blk += 32) {
+        float scale = exp2f((float)((int)ws[blk] - 127));
+        uint4 packed = *(const uint4*)(wp + (long long)blk * 16);  // 32 nibbles
+        const uint8_t* pb = (const uint8_t*)&packed;
+        const float* xk = xs + blk * MXFP4_BLOCK;
+        #pragma unroll
+        for (int i = 0; i < 16; i++) {
+            uint8_t b = pb[i];
+            acc += xk[2 * i]     * (sparse_fp4_to_float(b & 0xF) * scale);
+            acc += xk[2 * i + 1] * (sparse_fp4_to_float(b >> 4)  * scale);
+        }
+    }
+
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1) {
+        acc += __shfl_down_sync(0xffffffffu, acc, o);
+    }
+    if (lane == 0) {
+        float v = acc + __bfloat162float(bias[(long long)lid * N + n]);
+        y[((long long)token * top_k + slot) * N + n] = __float2bfloat16(v);
+    }
+}
+
+// Weighted sum over the slot axis: out[t, d] = sum over local slots of
+// topk_weights[t, k] * down[t, k, d]. Non-local slots hold uninitialized
+// memory and are SKIPPED (not multiplied by zero).
+__global__ void moe_weighted_sum_sparse_bf16_kernel(
+    const __nv_bfloat16* __restrict__ down,  // [T, top_k, hidden]
+    const int* __restrict__ topk_ids,        // [T, top_k]
+    const float* __restrict__ topk_weights,  // [T, top_k]
+    __nv_bfloat16* __restrict__ out,         // [T, hidden]
+    int num_tokens, int hidden, int top_k,
+    int expert_start, int local_experts
+) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int total = num_tokens * hidden;
+    if (idx >= total) return;
+
+    int token = idx / hidden;
+    int dim = idx % hidden;
+
+    float sum = 0.0f;
+    for (int k = 0; k < top_k; k++) {
+        int lid = topk_ids[token * top_k + k] - expert_start;
+        if (lid >= 0 && lid < local_experts) {
+            float w = topk_weights[token * top_k + k];
+            float v = __bfloat162float(
+                down[((long long)token * top_k + k) * hidden + dim]);
+            sum += w * v;
+        }
+    }
+    out[idx] = __float2bfloat16(sum);
+}
+
+extern "C" {
+
+void launch_moe_sparse_gemv_fp8_bf16(
+    const void* x, const void* w, const void* w_scales, const void* bias,
+    const void* topk_ids, void* y,
+    int num_tokens, int N, int K, int top_k,
+    int expert_start, int local_experts, int x_per_slot,
+    void* stream
+) {
+    dim3 grid((N + SPARSE_TILE_N - 1) / SPARSE_TILE_N, top_k, num_tokens);
+    int block = SPARSE_TILE_N * 32;
+    size_t smem = (size_t)K * sizeof(float);
+    moe_sparse_gemv_fp8_bf16_kernel<<<grid, block, smem, (cudaStream_t)stream>>>(
+        (const __nv_bfloat16*)x, (const __nv_fp8_e4m3*)w,
+        (const float*)w_scales, (const __nv_bfloat16*)bias,
+        (const int*)topk_ids, (__nv_bfloat16*)y,
+        N, K, top_k, expert_start, local_experts, x_per_slot
+    );
+    CUDA_CHECK_LAST_ERROR();
+}
+
+void launch_moe_sparse_gemv_mxfp4_bf16(
+    const void* x, const void* w_packed, const void* w_scales, const void* bias,
+    const void* topk_ids, void* y,
+    int num_tokens, int N, int K, int top_k,
+    int expert_start, int local_experts, int x_per_slot,
+    void* stream
+) {
+    dim3 grid((N + SPARSE_TILE_N - 1) / SPARSE_TILE_N, top_k, num_tokens);
+    int block = SPARSE_TILE_N * 32;
+    size_t smem = (size_t)K * sizeof(float);
+    moe_sparse_gemv_mxfp4_bf16_kernel<<<grid, block, smem, (cudaStream_t)stream>>>(
+        (const __nv_bfloat16*)x, (const uint8_t*)w_packed,
+        (const uint8_t*)w_scales, (const __nv_bfloat16*)bias,
+        (const int*)topk_ids, (__nv_bfloat16*)y,
+        N, K, top_k, expert_start, local_experts, x_per_slot
+    );
+    CUDA_CHECK_LAST_ERROR();
+}
+
+void launch_moe_weighted_sum_sparse_bf16(
+    const void* down, const void* topk_ids, const void* topk_weights,
+    void* out,
+    int num_tokens, int hidden, int top_k,
+    int expert_start, int local_experts,
+    void* stream
+) {
+    int total = num_tokens * hidden;
+    int block = 256;
+    int grid = (total + block - 1) / block;
+    moe_weighted_sum_sparse_bf16_kernel<<<grid, block, 0, (cudaStream_t)stream>>>(
+        (const __nv_bfloat16*)down,
+        (const int*)topk_ids, (const float*)topk_weights,
+        (__nv_bfloat16*)out,
+        num_tokens, hidden, top_k, expert_start, local_experts
+    );
+    CUDA_CHECK_LAST_ERROR();
+}
+
+}
--- a/docs/20-sparse-moe.md
+++ b/docs/20-sparse-moe.md
@@ -0,0 +1,160 @@
+# Phase 20: Sparse MoE Decode — 只算被路由到的专家
+
+> 目标:消除 dense MoE 的无效权重读取,decode TPOT 追上并超过 llama.cpp。
+> 前置:Phase 19(gpt-oss MoE 正确性)、FP8 W8A8 / MXFP4 W4A16 量化
+> (见 `docs/benchmarks/fp8-quantization.md`、`docs/benchmarks/mxfp4-and-llama-decode.md`)。
+
+## 1. 现状:dense MoE 在浪费什么
+
+gpt-oss-20b 是 32 专家 top-4 的 MoE:router 给每个 token 选 4 个专家,
+理论上每 token 只需要读 4/32 = 12.5% 的专家权重。但 `moe_forward`
+(`crates/xserv-model/src/gpt_oss.rs`)目前是 **dense** 实现:
+
+```text
+1. router GEMV            [T, 2880] → [T, 32]
+2. topk_softmax (GPU)     → topk_ids [T,4], topk_weights [T,4]
+3. moe_replicate          x 复制 16 份 → [16, T, 2880]        ← 浪费开始
+4. batched GEMM gate_up   全部 16 个本地专家都算                ← 读 16 份权重
+5. bias + GLU
+6. batched GEMM down      全部 16 个本地专家都算                ← 读 16 份权重
+7. bias
+8. moe_weighted_sum       只挑出 top-4 加权求和,其余 12 个全部丢弃
+9. all-reduce
+```
+
+为什么当初这么写:batched GEMM(cuBLAS strided-batched)要求规则的
+`[E, T, K]` 形状;top-4 的专家编号在 **GPU** 上(`topk_ids`),host 不知道
+该挑哪几个,挑了形状也不规则。dense 是"先把正确性做出来"的合理起点,
+但每 token 把 16 个专家的权重从 HBM 全部读一遍。
+
+### 字节账本(decode,每 token,TP=2 每卡 16 个本地专家)
+
+每层每专家:gate_up `[2880, 5760]` + down `[2880, 2880]` ≈ 24.9 M 参数。
+
+| 方案 | 每卡每 token 专家字节 | 相对量 |
+|---|---|---|
+| xserv dense FP8(现状) | 16 × 24.9 MB × 24 层 ≈ **9.6 GB** | 1× |
+| xserv sparse FP8(本阶段) | ~2 × 24.9 MB × 24 层 ≈ **1.2 GB** | 1/8 |
+| llama.cpp sparse MXFP4 | ~2 × 12.5 MB × 24 层 ≈ **0.6 GB** | 1/16 |
+
+(top-4 均匀散落在 2 张卡上,期望每卡 2 个命中;严格说每层取的是
+两卡命中数的 max,期望 ≈ 2.6,仍是 ~6-8× 的节省。)
+
+实测旁证:FP8 dense TP=2 TPOT 13.1 ms,其中专家 GEMM ≈ 9.6 GB ÷ ~1 TB/s
+≈ 9.5 ms,其余(attention、qkv/o、lm_head、48 次 PCIe all-reduce)≈ 3.5 ms。
+**专家权重读取占 TPOT 的 ~3/4,这就是与 llama.cpp(6.6 ms)的全部差距。**
+
+## 2. Roofline:M=1 时为什么"省字节 = 省时间"
+
+decode 的 GEMV(M=1)每读 1 字节 FP8 权重只做 2 FLOP(乘加)。
+RTX 5090:HBM ~1.8 TB/s,BF16 算力 ~210 TFLOPS —— 算强比(arithmetic
+intensity)需要 ~100 FLOP/byte 才能喂饱算力,GEMV 只有 2。结论:
+
+1. **decode 完全 memory-bound**,tensor core 帮不上忙 → 手写 W8A16 GEMV
+   (权重 FP8、激活保持 BF16)不会输给 cuBLASLt 的 W8A8 tensor-core GEMM,
+   还省掉激活量化 kernel,精度更好(激活不再有量化误差)。
+2. 优化只有一个方向:**少读字节**。sparse(×8)与 4-bit(×2)正交,
+   可叠加。本阶段先做 sparse,FP8 与 MXFP4 两种权重格式都支持。
+
+## 3. Sparse 设计:让 kernel 自己按 topk_ids 索引权重
+
+关键观察:`topk_ids` 本来就在 GPU 上。不需要 host 知道选了谁 ——
+**让 GEMV kernel 的每个 block 自己读 `topk_ids[token, slot]`,
+直接寻址到对应专家的权重**,不命中本卡就整块退出。零 host 同步,
+管线保持完全异步(这是之前排查过的:decode 循环无 per-layer sync)。
+
+新数据流(`num_tokens ≤ 8` 时启用):
+
+```text
+x [T, 2880]
+  ├─ router → topk_ids/weights [T, 4]               (不变)
+  ├─ sparse GEMV gate_up  → [T, 4, 5760]   bias 已融合,非本地 slot 不写
+  ├─ GLU                  → [T*4, 2880]
+  ├─ sparse GEMV down     → [T, 4, 2880]   bias 已融合,非本地 slot 不写
+  └─ weighted_sum_sparse  → [T, 2880]      只累加本地 slot
+all-reduce                                            (不变)
+```
+
+`moe_replicate` 和独立的 bias kernel 在 sparse 路径下消失;FP8 路径还省掉
+`quantize_bf16_to_fp8_rowwise`。
+
+### Kernel 设计(`csrc/moe/moe_sparse.cu`)
+
+`moe_sparse_gemv_{fp8,mxfp4}_bf16_kernel`:
+
+- **grid = (N/8, top_k, tokens)**,block = 8 warp × 32 lane。
+  每个 block 负责一个 (token, slot) 的 8 个输出列,**一个 warp 算一个输出**。
+- block 先读 `eid = topk_ids[token*top_k + slot]`,折算 `lid = eid - expert_start`;
+  不在 `[0, local_experts)` 就整块 return。
+- 命中的 block 把激活行(K=2880 个 BF16 → float)协作搬进 shared memory
+  (11.25 KB),`__syncthreads()`,然后每 warp 沿 K 维做点积:
+  每 lane 一次 `uint4` 读 16 字节权重(FP8 = 16 个权重,MXFP4 = 32 个 nibble),
+  warp 内 32 lane 连续 → 512B coalesced 事务。
+- epilogue(lane 0):`y = acc * w_scale[lid] + bias[lid, n]` —— per-expert
+  scale 和 bias 都融合在这里,与 dense 路径的"GEMM → bias add → 路由加权"
+  语义逐位等价(HF 参考实现也是先加 bias 再乘路由权重)。
+- gate_up 与 down 共用同一个 kernel,用 `x_per_slot` 区分激活寻址:
+  gate_up 时 4 个 slot 共享 `x[token]`;down 时各读自己的 `act[token*4+slot]`。
+
+### 两个容易写错的安全点
+
+1. **early-return 必须 block-uniform。** Phase 19 的 GEMV 垃圾输出 bug
+   (commit `3b9e32e`)正是"部分线程在 `__syncthreads()` 之前 return"导致
+   读未初始化 shared memory。这里的 return 发生在 smem 装载**之前**,且整个
+   block 基于同一个 `topk_ids` 值统一退出 —— 没有 divergence,合法且安全。
+2. **weighted-sum 对非本地 slot 必须"跳过",不能"乘 0"。** 非本地 slot 的
+   GEMV 输出从未被写入(未初始化显存,可能是 NaN 位型),GLU 也会在上面算出
+   垃圾。`NaN × 0 = NaN`,所以求和 kernel 用 `if (local) sum += w*v` 跳过,
+   垃圾永远不进入数据流(dense 路径的 `moe_weighted_sum` 同理)。
+
+## 4. 为什么 prefill 保持 dense
+
+dense batched GEMM 把 16 份权重读**一次**,服务全部 M 个 token;
+sparse GEMV 是**每 token** 重读自己的 ~2 份。字节交叉点:
+
+```text
+sparse 读 M × 2 份  vs  dense 读 16 份  →  M ≈ 8 (TP=2)
+```
+
+M > 8 后 dense 更省(且 GEMM 是 compute-bound,tensor core 开始有用)。
+所以 sparse 只在 `num_tokens ≤ 8` 启用 —— 覆盖 decode(连续批合并的
+多请求 decode 也是小 M)和极短的 re-prefill。真正的 sparse prefill
+(按专家对 token 做 permute/gather 的 grouped GEMM,vLLM 的做法)是
+后续阶段,主要收益在长 prompt TTFT。
+
+## 5. 实测结果(2026-06-12,完整数据见 `docs/benchmarks/sparse-moe.md`)
+
+In-process decode(bench-gpt-oss,greedy 96 tok):
+
+| | TPOT | tok/s |
+|---|---|---|
+| dense FP8 TP=2(基线) | 13.9 ms | 72 |
+| **sparse FP8 TP=2** | **7.6 ms(1.8×)** | **132** |
+| sparse MXFP4 TP=2 | 8.4 ms | 118 |
+| sparse FP8 TP=1(单卡) | 7.8 ms | 128 |
+
+Warm-server 对打 llama.cpp(`tools/xserv_vs_llama.py`):
+
+- **TP=2 vs TP=2:xserv 首次全面反超** —— TPOT 7.19-7.32 ms vs llama
+  7.54-8.42 ms;短/中 prompt TTFT 也领先(35/49 vs 63/65 ms)。
+- **TP=1 vs TP=1:llama 大胜**(2.88-3.22 ms vs 7.0-7.2 ms,347 vs 140
+  tok/s)。单卡才是 llama 的最优配置:它的跨卡 split 在 PCIe 上每 token
+  损失 ~5 ms,而单卡时它"全模型 4-bit + CUDA graph 整 token 回放"的
+  优势全部兑现。xserv 的残余 ~7 ms ≈ ~3 ms HBM(其中非专家权重还是
+  BF16,含 1.16 GB 的 lm_head)+ ~4 ms 启动开销(~200 个 kernel
+  launch/token,无 CUDA graph)。
+- **正确性:GSM8K-100 = 96%**(dense FP8 91% / BF16 90%,greedy 噪声内,
+  无回归)。
+
+教训:之前"CUDA graph ≈ 无用(~0.5-1.5ms)"的结论是相对 13 ms 的
+dense TPOT 而言;专家成本砍掉后,launch 开销变成了最大的单项。
+
+## 6. 下一阶段(按收益排序)
+
+1. **decode CUDA graph**(~2-4 ms):当前最大单项。
+2. **非专家权重量化**(~1-1.5 ms):qkv/o + lm_head 仍是 BF16,每 token
+   白读 ~2.3 GB;llama 是全模型 4-bit。
+3. **sparse prefill**(grouped GEMM):长 prompt TTFT 94-120 ms → llama
+   的 ~30 ms 量级。
+4. **W4A4 FP4 tensor core / 带宽调优的 MXFP4 GEMV**:让 4-bit 专家真正
+   快过 FP8(目前 8.4 vs 7.6 ms,GEMV 效率抵消了字节优势)。
--- a/docs/benchmarks/sparse-moe.md
+++ b/docs/benchmarks/sparse-moe.md
@@ -0,0 +1,90 @@
+# Sparse MoE decode — 1.8× over dense; beats llama.cpp at TP=2 (gpt-oss-20b, RTX 5090)
+
+Phase 20 (`docs/20-sparse-moe.md`): decode computes only the routed top-4
+experts via fused expert-indexed GEMVs (`csrc/moe/moe_sparse.cu`) instead of
+the dense all-local-expert batched GEMM. FP8 weights run W8A16 (weights FP8,
+activations BF16 — decode is memory-bound, tensor cores irrelevant at M=1);
+MXFP4 runs W4A16. Dense path retained for prefill / `num_tokens > 8` and via
+`XSERV_DENSE_MOE=1` for A/B.
+
+## In-process decode (bench-gpt-oss, greedy, 96 tokens)
+
+| config | TPOT | tok/s |
+|---|---|---|
+| dense FP8 TP=2 (baseline) | 13.9 ms | 72 |
+| **sparse FP8 TP=2** | **7.6 ms** | **132** |
+| sparse MXFP4 TP=2 | 8.4 ms | 118 |
+| sparse FP8 TP=1 (one 5090) | 7.8 ms | 128 |
+| sparse MXFP4 TP=1 | 8.9 ms | 113 |
+
+- Sparse FP8 = **1.8× over dense**. Greedy output stays coherent.
+- TP=1 ≈ TP=2: expert reads are now so small that PCIe all-reduce eats the
+  TP gain — single-GPU serving becomes the attractive deployment.
+- MXFP4 reads half the bytes of FP8 but stays slower: the 4-bit dequant GEMV
+  has lower effective bandwidth (same fixed inefficiency seen in the dense
+  MXFP4 experiments); at sparse sizes both are partly launch/latency-bound.
+
+## Head-to-head vs llama.cpp (tools/xserv_vs_llama.py, warm servers, TP=2, GPUs 0-1, 6 reps, 256 tok)
+
+| prompt | metric | xserv sparse FP8 | llama MXFP4 | xserv vs llama |
+|---|---|---|---|---|
+| short | TTFT | **35.3 ms** | 62.7 ms | 1.78× faster |
+| short | TPOT | **7.32 ms** | 8.42 ms | 1.15× faster |
+| medium | TTFT | **49.4 ms** | 65.0 ms | 1.32× faster |
+| medium | TPOT | **7.19 ms** | 7.54 ms | 1.05× faster |
+| medium | tok/s | **139.1** | 132.7 | |
+| long (1.6k) | TTFT | 94.1 ms | **44.7 ms** | 0.48× (llama wins) |
+| long | TPOT | **7.25 ms** | 7.64 ms | 1.05× faster |
+
+**Decode TPOT now beats llama.cpp at every prompt length** (was 2× slower:
+13.1 vs 6.6 ms before sparse). Remaining loss: long-prompt TTFT — prefill is
+still the dense all-expert GEMM; sparse/grouped prefill is the next phase.
+
+## TP=1 head-to-head (single 5090; server now routes gpt-oss tp=1 to the TP engine)
+
+| prompt | metric | xserv sparse FP8 | llama MXFP4 |
+|---|---|---|---|
+| short | TTFT / TPOT | 42.8 ms / 7.00 ms | **34.5 ms / 3.22 ms** |
+| medium | TTFT / TPOT | 57.1 ms / 7.19 ms | **37.3 ms / 2.89 ms** |
+| long | TTFT / TPOT | 119.6 ms / 7.20 ms | **27.8 ms / 2.88 ms** |
+| | tok/s | 139–143 | **311–347** |
+
+**Single-GPU is llama.cpp's sweet spot and it wins 2.2–2.5×.** Two structural
+reasons, both instructive:
+
+1. llama TP=2 (7.5–8.4 ms) is much WORSE than its TP=1 (2.9 ms): its PCIe
+   cross-GPU split costs ~5 ms/token. xserv's NCCL all-reduce is cheap enough
+   that TP=2 ≈ TP=1 (7.2 vs 7.0 ms) — but xserv's single-GPU floor is high.
+2. xserv TP=1 reads ~4.7 GB/token (experts FP8 2.4 GB + **non-expert weights
+   still BF16** ~2.3 GB, half of that the 201k-vocab lm_head) ≈ 3.1 ms of pure
+   HBM time; the other ~4 ms is launch overhead (~200 kernels/token, no CUDA
+   graphs) + BF16 GEMV efficiency. llama reads ~1.3 GB (everything MXFP4) and
+   replays the whole token as one CUDA graph.
+
+## Correctness
+
+- Greedy generations coherent across prompts (FP8/MXFP4, TP=1/2).
+- Sparse FP8 is W8A16 vs dense W8A8 — activations are no longer quantized, so
+  tokens are not expected to be byte-identical to dense; quality is checked by
+  GSM8K instead.
+- **GSM8K-100 (greedy, TP=2, `tools/eval_gsm8k_fast.py`): 96/100 = 96.0%** vs
+  dense FP8 91.0% / BF16 90.0% — no regression (within greedy-nondeterminism
+  noise; W8A16 removes activation-quantization error so ≥ dense is expected).
+  Avg 1.3 s/problem also reflects the decode speedup.
+
+## Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)
+
+Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into
+~3 ms HBM reads and ~4 ms fixed overhead. In impact order:
+
+1. **CUDA graphs for decode** (~2–4 ms): with experts down to ~1–2 ms, the
+   ~200 un-graphed launches/token are now the single largest cost. (The old
+   "graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no
+   longer true.)
+2. **Quantize non-expert weights** (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16
+   lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
+3. **Sparse prefill** (permute tokens by expert + grouped GEMM): long-prompt
+   TTFT 94–120 ms → llama's ~30 ms territory.
+4. **W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV**: make 4-bit experts
+   actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit
+   GEMV's lower effective bandwidth still cancels its byte advantage).