Files

Gahow Wang fb20178992 moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2)

Dense MoE replicated x across all 16 local experts and ran the full
batched GEMM, reading every expert's weights per token; the weighted
sum then discarded 12 of 16 results. Decode is memory-bound, so this
was ~8x wasted expert bytes — the entire decode gap vs llama.cpp.

New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read
topk_ids on-device (no host sync) and early-return block-uniformly
for experts other ranks own. FP8 runs W8A16 (activations stay BF16 —
tensor cores are irrelevant at M=1, and activation quantization error
disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the
GEMV epilogue; slot-indexed weighted sum skips (never multiplies)
unwritten non-local slots. Dense path retained for num_tokens > 8
(prefill) and via XSERV_DENSE_MOE=1 for A/B.

dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms.
Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms —
first config where xserv wins decode outright. GSM8K-100: 96% (dense
FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode
CUDA graphs, non-expert quantization, sparse prefill (docs/20).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 16:29:10 +08:00

4.6 KiB

Raw Blame History

Sparse MoE decode — 1.8× over dense; beats llama.cpp at TP=2 (gpt-oss-20b, RTX 5090)

Phase 20 (docs/20-sparse-moe.md): decode computes only the routed top-4 experts via fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) instead of the dense all-local-expert batched GEMM. FP8 weights run W8A16 (weights FP8, activations BF16 — decode is memory-bound, tensor cores irrelevant at M=1); MXFP4 runs W4A16. Dense path retained for prefill / num_tokens > 8 and via XSERV_DENSE_MOE=1 for A/B.

In-process decode (bench-gpt-oss, greedy, 96 tokens)

config	TPOT	tok/s
dense FP8 TP=2 (baseline)	13.9 ms	72
sparse FP8 TP=2	7.6 ms	132
sparse MXFP4 TP=2	8.4 ms	118
sparse FP8 TP=1 (one 5090)	7.8 ms	128
sparse MXFP4 TP=1	8.9 ms	113

Sparse FP8 = 1.8× over dense. Greedy output stays coherent.
TP=1 ≈ TP=2: expert reads are now so small that PCIe all-reduce eats the TP gain — single-GPU serving becomes the attractive deployment.
MXFP4 reads half the bytes of FP8 but stays slower: the 4-bit dequant GEMV has lower effective bandwidth (same fixed inefficiency seen in the dense MXFP4 experiments); at sparse sizes both are partly launch/latency-bound.

Head-to-head vs llama.cpp (tools/xserv_vs_llama.py, warm servers, TP=2, GPUs 0-1, 6 reps, 256 tok)

prompt	metric	xserv sparse FP8	llama MXFP4	xserv vs llama
short	TTFT	35.3 ms	62.7 ms	1.78× faster
short	TPOT	7.32 ms	8.42 ms	1.15× faster
medium	TTFT	49.4 ms	65.0 ms	1.32× faster
medium	TPOT	7.19 ms	7.54 ms	1.05× faster
medium	tok/s	139.1	132.7
long (1.6k)	TTFT	94.1 ms	44.7 ms	0.48× (llama wins)
long	TPOT	7.25 ms	7.64 ms	1.05× faster

Decode TPOT now beats llama.cpp at every prompt length (was 2× slower: 13.1 vs 6.6 ms before sparse). Remaining loss: long-prompt TTFT — prefill is still the dense all-expert GEMM; sparse/grouped prefill is the next phase.

TP=1 head-to-head (single 5090; server now routes gpt-oss tp=1 to the TP engine)

prompt	metric	xserv sparse FP8	llama MXFP4
short	TTFT / TPOT	42.8 ms / 7.00 ms	34.5 ms / 3.22 ms
medium	TTFT / TPOT	57.1 ms / 7.19 ms	37.3 ms / 2.89 ms
long	TTFT / TPOT	119.6 ms / 7.20 ms	27.8 ms / 2.88 ms
	tok/s	139–143	311–347

Single-GPU is llama.cpp's sweet spot and it wins 2.2–2.5×. Two structural reasons, both instructive:

llama TP=2 (7.5–8.4 ms) is much WORSE than its TP=1 (2.9 ms): its PCIe cross-GPU split costs ~5 ms/token. xserv's NCCL all-reduce is cheap enough that TP=2 ≈ TP=1 (7.2 vs 7.0 ms) — but xserv's single-GPU floor is high.
xserv TP=1 reads ~4.7 GB/token (experts FP8 2.4 GB + non-expert weights still BF16 ~2.3 GB, half of that the 201k-vocab lm_head) ≈ 3.1 ms of pure HBM time; the other ~4 ms is launch overhead (~200 kernels/token, no CUDA graphs) + BF16 GEMV efficiency. llama reads ~1.3 GB (everything MXFP4) and replays the whole token as one CUDA graph.

Correctness

Greedy generations coherent across prompts (FP8/MXFP4, TP=1/2).
Sparse FP8 is W8A16 vs dense W8A8 — activations are no longer quantized, so tokens are not expected to be byte-identical to dense; quality is checked by GSM8K instead.
GSM8K-100 (greedy, TP=2, tools/eval_gsm8k_fast.py): 96/100 = 96.0% vs dense FP8 91.0% / BF16 90.0% — no regression (within greedy-nondeterminism noise; W8A16 removes activation-quantization error so ≥ dense is expected). Avg 1.3 s/problem also reflects the decode speedup.

Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)

Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into ~3 ms HBM reads and ~4 ms fixed overhead. In impact order:

CUDA graphs for decode (~2–4 ms): with experts down to ~1–2 ms, the ~200 un-graphed launches/token are now the single largest cost. (The old "graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no longer true.)
Quantize non-expert weights (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16 lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
Sparse prefill (permute tokens by expert + grouped GEMM): long-prompt TTFT 94–120 ms → llama's ~30 ms territory.
W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV: make 4-bit experts actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit GEMV's lower effective bandwidth still cancels its byte advantage).

4.6 KiB Raw Blame History Unescape Escape