Files

Gahow Wang 24c49c31c2 tools: warm-server FP8 vs BF16 benchmark + results doc

fp8_compare.py launches one xserv-server per model (same GPUs / TP for a
fair comparison), gates readiness on a real generation (not /health),
and streams GSM8K through /v1/chat/completions measuring per-request
TTFT (time to first token) and TPOT (mean inter-token latency) plus
exact-match accuracy. docs/benchmarks/fp8-quantization.md records the
quantization scheme, the perf-bug fix, and the dash5 results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-12 00:58:46 +08:00

3.9 KiB

Raw Blame History

FP8 W8A8 quantization — gpt-oss-20b (dash5, 8× RTX 5090)

Operator-level FP8 E4M3 quantization of the MoE expert weights, with real cuBLASLt FP8 tensor-core GEMM (W8A8: FP8 weights × dynamically-quantized FP8 activations). All other tensors (attention, router, embeddings, norms, biases) stay BF16.

Scheme

Weights (tools/quantize_fp8.py): expert gate_up_proj / down_proj quantized BF16 → FP8 E4M3 with a per-expert scalar scale (absmax/448). Stored transposed [E, N, K] because cuBLASLt FP8 on Blackwell (sm120) requires transA=T.
Activations: quantized dynamically at runtime, per-token (per-row absmax), recovered by a post-GEMM row scale.
Compute: batched_gemm_fp8 (crates/xserv-kernels/src/quantization.rs) runs one cuBLASLt FP8 matmul per expert; the per-expert weight scale is supplied via the cuBLASLt B-scale device pointer (FP32 epilogue, so precision matches folding it into alpha).
Model size: 22 GB (FP8) vs 39 GB (BF16). The FP8 model fits on a single 32 GB 5090; BF16 needs ≥ 2.

The performance bug that was fixed

batched_gemm_fp8 originally rebuilt the entire cuBLASLt plan per expert, per GEMM, per layer, on every forward pass — running the algo heuristic search, creating/destroying the descriptor + 4 layouts + preference, and cudaMalloc-ing a 4-byte scale buffer — roughly 1500 heuristic searches per decoded token. This made FP8 slower than BF16:

	FP8 (buggy)	FP8 (fixed)	BF16
Decode TPOT	27.0 ms	17.9 ms	18.8 ms
Throughput	37 tok/s	55.8 tok/s	53.2 tok/s

Fix: cache the cuBLASLt plan (descriptor + layouts + heuristically-chosen algo) in a thread-local map keyed by (M, N, K) so the heuristic runs once per shape; allocate the scale buffer once; pass per-expert weight scales by device pointer. The per-expert loop now issues only cublasLtMatmul.

Results — GSM8K (200 problems, greedy, TP=2 on the same 2 GPUs)

Harness: tools/fp8_compare.py — a warm xserv-server per model, GSM8K streamed through /v1/chat/completions; TTFT = time to first token, TPOT = mean inter-token latency, per request.

metric	FP8 W8A8	BF16
GSM8K accuracy	93.0 %	90.5 %
TTFT median	67.4 ms	68.8 ms
TTFT p90	90.4 ms	96.7 ms
TPOT median	17.45 ms	18.26 ms
TPOT p90	17.65 ms	18.38 ms
Throughput	57.3 tok/s	54.8 tok/s
Mean output tokens	288	293

Accuracy: unchanged. FP8 is nominally +2.5 pts, but with n=200 the standard error is ~2.1 pts, so the two are statistically indistinguishable. The takeaway is that FP8 did not degrade accuracy.
Decode: FP8 ~5 % faster (TPOT 17.45 vs 18.26 ms), reproducible across runs, with a tighter p90. Modest because the dense-MoE path loads all experts every token and FP8 only halves the expert bytes; the per-expert M=1 launches and M=1 tensor-core inefficiency absorb much of the bandwidth saving.
Prefill (TTFT): comparable. A multi-length sweep (113 / 561 / 1681 tokens) gave FP8 480 / 362 / 2451 ms vs BF16 558 / 282 / 2287 ms — non-monotonic, i.e. dominated by fixed overhead (cuBLAS lazy init + FP8's one-time per-shape heuristic), not prefill compute, at these lengths.

Single-GPU (TP=1)

FP8 runs gpt-oss-20b on one 5090 (bench-gpt-oss --tp 1, GPU6): TTFT 538 ms, TPOT 29.0 ms, 34.5 tok/s. BF16 cannot (39 GB > 32 GB). This — fitting a model that otherwise needs two GPUs onto one — is the largest practical win.

Follow-ups (not done)

Strided-batched FP8 (one call instead of ~768 per-expert launches per token) — requires folding the per-expert weight scale into the post-scale kernel, at a BF16-intermediate precision cost.
Per-channel (per-output-row) weight scales for better accuracy headroom than per-tensor.
Warm common prefill shapes at load to hide the first-request heuristic stall.

3.9 KiB Raw Blame History Unescape Escape