Files

Gahow Wang 2a92f268a9 docs: fill the Phase 19 gap, refresh README/roadmap to actual state

- docs/19-gpt-oss-moe.md: the numbered series jumped 18->20; write up
  gpt-oss arch deltas, harmony pitfalls, and the two CUDA debugging
  postmortems (fully-masked-tile NaN in flash-attention sinks;
  pre-__syncthreads early return reading uninitialized smem in the
  decode GEMV) — the highest-value learning content of that phase.
- README: models/perf/capabilities were frozen at the Qwen3-only era;
  now lists gpt-oss MoE, TP/PP, FP8/MXFP4, sparse MoE, and the
  llama.cpp standing.
- Roadmap: record where reality diverged from the plan at Phase 18+,
  add milestone entries and the ranked next-phase candidates
  (21 CUDA-graph MoE decode, 22 non-expert quant, 23 sparse prefill).
- sparse-moe benchmark doc: post-review-fix numbers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-12 17:02:59 +08:00

4.9 KiB

Raw Blame History

Sparse MoE decode — 1.8× over dense; beats llama.cpp at TP=2 (gpt-oss-20b, RTX 5090)

Phase 20 (docs/20-sparse-moe.md): decode computes only the routed top-4 experts via fused expert-indexed GEMVs (csrc/moe/moe_sparse.cu) instead of the dense all-local-expert batched GEMM. FP8 weights run W8A16 (weights FP8, activations BF16 — decode is memory-bound, tensor cores irrelevant at M=1); MXFP4 runs W4A16. Dense path retained for prefill / num_tokens > 8 and via XSERV_DENSE_MOE=1 for A/B.

In-process decode (bench-gpt-oss, greedy, 96 tokens)

config	TPOT	tok/s
dense FP8 TP=2 (baseline)	13.9 ms	72
sparse FP8 TP=2	7.6 ms	132
sparse MXFP4 TP=2	8.4 ms	118
sparse FP8 TP=1 (one 5090)	7.8 ms	128
sparse MXFP4 TP=1	8.9 ms	113

Sparse FP8 = 1.8× over dense. Greedy output stays coherent.
TP=1 ≈ TP=2: expert reads are now so small that PCIe all-reduce eats the TP gain — single-GPU serving becomes the attractive deployment.
MXFP4 reads half the bytes of FP8 but stays slower: the 4-bit dequant GEMV has lower effective bandwidth (same fixed inefficiency seen in the dense MXFP4 experiments); at sparse sizes both are partly launch/latency-bound.

Head-to-head vs llama.cpp (tools/xserv_vs_llama.py, warm servers, TP=2, GPUs 0-1, 6 reps, 256 tok)

prompt	metric	xserv sparse FP8	llama MXFP4	xserv vs llama
short	TTFT	35.3 ms	62.7 ms	1.78× faster
short	TPOT	7.32 ms	8.42 ms	1.15× faster
medium	TTFT	49.4 ms	65.0 ms	1.32× faster
medium	TPOT	7.19 ms	7.54 ms	1.05× faster
medium	tok/s	139.1	132.7
long (1.6k)	TTFT	94.1 ms	44.7 ms	0.48× (llama wins)
long	TPOT	7.25 ms	7.64 ms	1.05× faster

Decode TPOT now beats llama.cpp at every prompt length (was 2× slower: 13.1 vs 6.6 ms before sparse). Remaining loss: long-prompt TTFT — prefill is still the dense all-expert GEMM; sparse/grouped prefill is the next phase.

Post-review fixes (same harness, rerun): removing three leftover cudaDeviceSynchronize from the decode hot path and replacing the CPU-tiled prefill bias-add (96 D2H/H2D round-trips per prefill) with a GPU broadcast kernel improved both axes — TPOT 7.19-7.32 → 6.99-7.21 ms, TTFT short/medium/long 35/49/94 → 29/42/79 ms. GSM8K-50: 94% (unchanged).

TP=1 head-to-head (single 5090; server now routes gpt-oss tp=1 to the TP engine)

prompt	metric	xserv sparse FP8	llama MXFP4
short	TTFT / TPOT	42.8 ms / 7.00 ms	34.5 ms / 3.22 ms
medium	TTFT / TPOT	57.1 ms / 7.19 ms	37.3 ms / 2.89 ms
long	TTFT / TPOT	119.6 ms / 7.20 ms	27.8 ms / 2.88 ms
	tok/s	139–143	311–347

Single-GPU is llama.cpp's sweet spot and it wins 2.2–2.5×. Two structural reasons, both instructive:

llama TP=2 (7.5–8.4 ms) is much WORSE than its TP=1 (2.9 ms): its PCIe cross-GPU split costs ~5 ms/token. xserv's NCCL all-reduce is cheap enough that TP=2 ≈ TP=1 (7.2 vs 7.0 ms) — but xserv's single-GPU floor is high.
xserv TP=1 reads ~4.7 GB/token (experts FP8 2.4 GB + non-expert weights still BF16 ~2.3 GB, half of that the 201k-vocab lm_head) ≈ 3.1 ms of pure HBM time; the other ~4 ms is launch overhead (~200 kernels/token, no CUDA graphs) + BF16 GEMV efficiency. llama reads ~1.3 GB (everything MXFP4) and replays the whole token as one CUDA graph.

Correctness

Greedy generations coherent across prompts (FP8/MXFP4, TP=1/2).
Sparse FP8 is W8A16 vs dense W8A8 — activations are no longer quantized, so tokens are not expected to be byte-identical to dense; quality is checked by GSM8K instead.
GSM8K-100 (greedy, TP=2, tools/eval_gsm8k_fast.py): 96/100 = 96.0% vs dense FP8 91.0% / BF16 90.0% — no regression (within greedy-nondeterminism noise; W8A16 removes activation-quantization error so ≥ dense is expected). Avg 1.3 s/problem also reflects the decode speedup.

Remaining gaps / next levers (to catch llama TP=1 at 2.9 ms)

Sparse MoE removed the dominant cost; the residual ~7 ms splits roughly into ~3 ms HBM reads and ~4 ms fixed overhead. In impact order:

CUDA graphs for decode (~2–4 ms): with experts down to ~1–2 ms, the ~200 un-graphed launches/token are now the single largest cost. (The old "graphs ≈ useless" conclusion was relative to a 13 ms dense TPOT — no longer true.)
Quantize non-expert weights (~1–1.5 ms): attn qkv/o + the 1.16 GB BF16 lm_head read every token; FP8/MXFP4 them like llama quantizes everything.
Sparse prefill (permute tokens by expert + grouped GEMM): long-prompt TTFT 94–120 ms → llama's ~30 ms territory.
W4A4 FP4 tensor cores / bandwidth-tuned MXFP4 GEMV: make 4-bit experts actually beat FP8 (today sparse MXFP4 is 8.4 ms vs FP8 7.6 ms — the 4-bit GEMV's lower effective bandwidth still cancels its byte advantage).

4.9 KiB Raw Blame History Unescape Escape