xserv/docs/benchmarks/mxfp4-and-llama-decode.md

# MXFP4 W4A16 + decode-speed vs llama.cpp (gpt-oss-20b, 2×RTX 5090)

## xserv vs llama.cpp — single-stream decode (TP=2, same GPUs)

`tools/xserv_vs_llama.py` streams identical prompts through each server's
OpenAI endpoint (counting llama's `reasoning_content` as real decode tokens).

| metric | xserv FP8 | llama MXFP4 |
|---|---|---|
| Decode TPOT (medium) | 13.1 ms | **6.6 ms** (2.0× faster) |
| Throughput | 76 tok/s | **151 tok/s** |
| TTFT (short/medium) | 35–50 ms | 60–63 ms |
| TTFT (long, 1.6k tok) | 94 ms | **35 ms** |

llama.cpp decodes ~2× faster; prefill is comparable-to-better.

## Why — decode is memory/comm-bound, not launch-bound

Traced + measured (not assumed):

- The 24-layer decode loop is already fully async (no per-layer syncs), so kernel
  launches hide behind GPU work — a CUDA graph would buy ~0.5–1.5 ms, not 2×.
- **TP=2→TP=4 probe**: TPOT 13.5→10.2 ms (FP8) with the *same* launch count and
  *more* NCCL — confirms the bottleneck is **expert HBM traffic + all-reduce**,
  not launch overhead.
- Even FP8 TP=4 (10.2 ms) can't catch llama TP=2 (6.6 ms): the gap is
  *algorithmic*. llama is **sparse (top-4 of 32 experts) + 4-bit (MXFP4)**;
  xserv is **dense (all 16 local experts) + 8-bit (FP8)** → ~8× the expert bytes
  per token. Dense also makes xserv's long-prefill TTFT worse.

The two levers that close it: **sparse top-k MoE** (≈4×, the bigger structural
change) and **4-bit weights** (≈2×).

## MXFP4 W4A16 (this change) — correct, smallest, not yet faster than FP8

Weight-only 4-bit: expert weights are MXFP4 (E2M1 + per-32 UE8M0 scale,
`tools/quantize_mxfp4.py`); a fused kernel reads the 4-bit weights and
dequantizes on-chip to BF16. Decode uses `batched_gemv_mxfp4`; prefill (M>1)
dequantizes to BF16 then reuses the BF16 batched GEMM.

| | MXFP4 W4A16 | FP8 W8A8 | BF16 |
|---|---|---|---|
| Model size | **13 GB** | 22 GB | 39 GB |
| Greedy tokens | identical | identical | baseline |
| Decode TPOT (TP=2) | 17.0 ms | **13.5 ms** | 18.8 ms |
| Decode TPOT (TP=4) | 11.8 ms | **10.2 ms** | — |
| Prefill TTFT | 350 ms | **134 ms** | 135 ms |

- **Correct** (byte-identical greedy tokens to FP8/BF16) and **smallest
  footprint** — fits one 32 GB 5090 with ample room for KV cache.
- **Not faster than FP8**: the hand-written W4A16 dequant-GEMV (no tensor cores)
  is less efficient than cuBLASLt's FP8 tensor-core GEMM, so even reading half
  the bytes it stays ~2–3.5 ms behind FP8 at every TP. The TP=4 scaling
  (17→11.8) shows it *is* partly memory-bound; a fixed per-GEMM inefficiency
  dominates. Vectorized loads, hoisted scale, warp reduction, and shared-memory
  activation tiling did not change it.
- **Prefill regresses** (350 vs 134 ms) — the dequant-to-BF16 fallback.

Committed as a **memory-optimization foundation**, not a decode speedup.

## To make 4-bit actually win

- **FP4 tensor cores (W4A4)** — cuBLASLt block-scaled MXFP4 GEMM
  (`CUDA_R_4F_E2M1` + `CUBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0`, available on
  sm_120). Tensor-core throughput *at* 4-bit would beat FP8. Risk: the scale
  swizzle layout.
- A **Marlin-class W4A16 kernel** (register-blocked, async-copy pipelined).
- **Sparse top-k MoE** for the larger, llama-matching win.

FP8 (the plan-cache fix + strided-batched optimization, 1.41× over BF16) remains
xserv's best-performing quantization today.