Dense MoE replicated x across all 16 local experts and ran the full batched GEMM, reading every expert's weights per token; the weighted sum then discarded 12 of 16 results. Decode is memory-bound, so this was ~8x wasted expert bytes — the entire decode gap vs llama.cpp. New fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) read topk_ids on-device (no host sync) and early-return block-uniformly for experts other ranks own. FP8 runs W8A16 (activations stay BF16 — tensor cores are irrelevant at M=1, and activation quantization error disappears); MXFP4 runs W4A16. Per-expert bias + scale fused into the GEMV epilogue; slot-indexed weighted sum skips (never multiplies) unwritten non-local slots. Dense path retained for num_tokens > 8 (prefill) and via XSERV_DENSE_MOE=1 for A/B. dash5 (RTX 5090), gpt-oss-20b FP8, TP=2: decode TPOT 13.9 -> 7.6 ms. Warm-server vs llama.cpp MXFP4 TP=2: TPOT 7.19-7.32 vs 7.54-8.42 ms — first config where xserv wins decode outright. GSM8K-100: 96% (dense FP8: 91%). llama TP=1 (2.9 ms) remains ahead: next levers are decode CUDA graphs, non-expert quantization, sparse prefill (docs/20). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
4.6 KiB
Sparse MoE decode — 1.8× over dense; beats llama.cpp at TP=2 (gpt-oss-20b, RTX 5090)
Phase 20 (docs/20-sparse-moe.md): decode computes only the routed top-4
experts via fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) instead of
the dense all-local-expert batched GEMM. FP8 weights run W8A16 (weights FP8,
activations BF16 — decode is memory-bound, tensor cores irrelevant at M=1);
MXFP4 runs W4A16. Dense path retained for prefill / num_tokens > 8 and via
XSERV_DENSE_MOE=1 for A/B.
In-process decode (bench-gpt-oss, greedy, 96 tokens)
| config | TPOT | tok/s |
|---|---|---|
| dense FP8 TP=2 (baseline) | 13.9 ms | 72 |
| sparse FP8 TP=2 | 7.6 ms | 132 |
| sparse MXFP4 TP=2 | 8.4 ms | 118 |
| sparse FP8 TP=1 (one 5090) | 7.8 ms | 128 |
| sparse MXFP4 TP=1 | 8.9 ms | 113 |
- Sparse FP8 = 1.8× over dense. Greedy output stays coherent.
- TP=1 ≈ TP=2: expert reads are now so small that PCIe all-reduce eats the TP gain — single-GPU serving becomes the attractive deployment.
- MXFP4 reads half the bytes of FP8 but stays slower: the 4-bit dequant GEMV has lower effective bandwidth (same fixed inefficiency seen in the dense MXFP4 experiments); at sparse sizes both are partly launch/latency-bound.
Head-to-head vs llama.cpp (tools/xserv_vs_llama.py, warm servers, TP=2, GPUs 0-1, 6 reps, 256 tok)
| prompt | metric | xserv sparse FP8 | llama MXFP4 | xserv vs llama |
|---|---|---|---|---|
| short | TTFT | 35.3 ms | 62.7 ms | 1.78× faster |
| short | TPOT | 7.32 ms | 8.42 ms | 1.15× faster |
| medium | TTFT | 49.4 ms | 65.0 ms | 1.32× faster |
| medium | TPOT | 7.19 ms | 7.54 ms | 1.05× faster |
| medium | tok/s | 139.1 | 132.7 | |
| long (1.6k) | TTFT | 94.1 ms | 44.7 ms | 0.48× (llama wins) |
| long | TPOT | 7.25 ms | 7.64 ms | 1.05× faster |
Decode TPOT now beats llama.cpp at every prompt length (was 2× slower: 13.1 vs 6.6 ms before sparse). Remaining loss: long-prompt TTFT — prefill is still the dense all-expert GEMM; sparse/grouped prefill is the next phase.
TP=1 head-to-head (single 5090; server now routes gpt-oss tp=1 to the TP engine)
| prompt | metric | xserv sparse FP8 | llama MXFP4 |
|---|---|---|---|
| short | TTFT / TPOT | 42.8 ms / 7.00 ms | 34.5 ms / 3.22 ms |
| medium | TTFT / TPOT | 57.1 ms / 7.19 ms | 37.3 ms / 2.89 ms |
| long | TTFT / TPOT | 119.6 ms / 7.20 ms | 27.8 ms / 2.88 ms |
| tok/s | 139–143 | 311–347 |
Single-GPU is llama.cpp's sweet spot and it wins 2.2–2.5×. Two structural reasons, both instructive:
- llama TP=2 (7.5–8.4 ms) is much WORSE than its TP=1 (2.9 ms): its PCIe cross-GPU split costs ~5 ms/token. xserv's NCCL all-reduce is cheap enough that TP=2 ≈ TP=1 (7.2 vs 7.0 ms) — but xserv's single-GPU floor is high.
- xserv TP=1 reads ~4.7 GB/token (experts FP8 2.4 GB + non-expert weights still BF16 ~2.3 GB, half of that the 201k-vocab lm_head) ≈ 3.1 ms of pure HBM time; the other ~4 ms is launch overhead (~200 kernels/token, no CUDA graphs) + BF16 GEMV efficiency. llama reads ~1.3 GB (everything MXFP4) and replays the whole token as one CUDA graph.
Correctness
- Greedy generations coherent across prompts (FP8/MXFP4, TP=1/2).
- Sparse FP8 is W8A16 vs dense W8A8 — activations are no longer quantized, so tokens are not expected to be byte-identical to dense; quality is checked by GSM8K instead.
- GSM8K-100 (greedy, TP=2,
tools/eval_gsm8k_fast.py): 96/100 = 96.0% vs dense FP8 91.0% / BF16 90.0% — no regression (within greedy-nondeterminism noise; W8A16 removes activation-quantization error so ≥ dense is expected). Avg 1.3 s/problem also reflects the decode speedup.
Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)
Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into ~3 ms HBM reads and ~4 ms fixed overhead. In impact order:
- CUDA graphs for decode (~2–4 ms): with experts down to ~1–2 ms, the ~200 un-graphed launches/token are now the single largest cost. (The old "graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no longer true.)
- Quantize non-expert weights (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16 lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
- Sparse prefill (permute tokens by expert + grouped GEMM): long-prompt TTFT 94–120 ms → llama's ~30 ms territory.
- W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV: make 4-bit experts actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit GEMV's lower effective bandwidth still cancels its byte advantage).