tools: warm-server FP8 vs BF16 benchmark + results doc

fp8_compare.py launches one xserv-server per model (same GPUs / TP for a fair comparison), gates readiness on a real generation (not /health), and streams GSM8K through /v1/chat/completions measuring per-request TTFT (time to first token) and TPOT (mean inter-token latency) plus exact-match accuracy. docs/benchmarks/fp8-quantization.md records the quantization scheme, the perf-bug fix, and the dash5 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 00:58:46 +08:00
parent 5a16225c1f
commit 24c49c31c2
2 changed files with 366 additions and 0 deletions
--- a/docs/benchmarks/fp8-quantization.md
+++ b/docs/benchmarks/fp8-quantization.md
@@ -0,0 +1,83 @@
+# FP8 W8A8 quantization — gpt-oss-20b (dash5, 8× RTX 5090)
+
+Operator-level FP8 E4M3 quantization of the MoE expert weights, with real
+cuBLASLt FP8 tensor-core GEMM (W8A8: FP8 weights × dynamically-quantized FP8
+activations). All other tensors (attention, router, embeddings, norms, biases)
+stay BF16.
+
+## Scheme
+
+- **Weights** (`tools/quantize_fp8.py`): expert `gate_up_proj` / `down_proj`
+  quantized BF16 → FP8 E4M3 with a **per-expert scalar** scale (`absmax/448`).
+  Stored transposed `[E, N, K]` because cuBLASLt FP8 on Blackwell (sm120)
+  requires `transA=T`.
+- **Activations**: quantized dynamically at runtime, **per-token** (per-row
+  absmax), recovered by a post-GEMM row scale.
+- **Compute**: `batched_gemm_fp8` (`crates/xserv-kernels/src/quantization.rs`)
+  runs one cuBLASLt FP8 matmul per expert; the per-expert weight scale is
+  supplied via the cuBLASLt B-scale device pointer (FP32 epilogue, so precision
+  matches folding it into `alpha`).
+- Model size: **22 GB** (FP8) vs **39 GB** (BF16). The FP8 model fits on a
+  single 32 GB 5090; BF16 needs ≥ 2.
+
+## The performance bug that was fixed
+
+`batched_gemm_fp8` originally rebuilt the entire cuBLASLt plan **per expert,
+per GEMM, per layer, on every forward pass** — running the algo heuristic
+search, creating/destroying the descriptor + 4 layouts + preference, and
+`cudaMalloc`-ing a 4-byte scale buffer — roughly 1500 heuristic searches per
+decoded token. This made FP8 **slower than BF16**:
+
+| | FP8 (buggy) | FP8 (fixed) | BF16 |
+|---|---|---|---|
+| Decode TPOT | 27.0 ms | **17.9 ms** | 18.8 ms |
+| Throughput | 37 tok/s | **55.8 tok/s** | 53.2 tok/s |
+
+Fix: cache the cuBLASLt plan (descriptor + layouts + heuristically-chosen algo)
+in a thread-local map keyed by `(M, N, K)` so the heuristic runs once per shape;
+allocate the scale buffer once; pass per-expert weight scales by device pointer.
+The per-expert loop now issues only `cublasLtMatmul`.
+
+## Results — GSM8K (200 problems, greedy, TP=2 on the same 2 GPUs)
+
+Harness: `tools/fp8_compare.py` — a warm `xserv-server` per model, GSM8K streamed
+through `/v1/chat/completions`; TTFT = time to first token, TPOT = mean
+inter-token latency, per request.
+
+| metric | FP8 W8A8 | BF16 |
+|---|---|---|
+| GSM8K accuracy | **93.0 %** | 90.5 % |
+| TTFT median | 67.4 ms | 68.8 ms |
+| TTFT p90 | 90.4 ms | 96.7 ms |
+| TPOT median | **17.45 ms** | 18.26 ms |
+| TPOT p90 | 17.65 ms | 18.38 ms |
+| Throughput | **57.3 tok/s** | 54.8 tok/s |
+| Mean output tokens | 288 | 293 |
+
+- **Accuracy: unchanged.** FP8 is nominally +2.5 pts, but with n=200 the
+  standard error is ~2.1 pts, so the two are statistically indistinguishable.
+  The takeaway is that FP8 did **not** degrade accuracy.
+- **Decode: FP8 ~5 % faster** (TPOT 17.45 vs 18.26 ms), reproducible across
+  runs, with a tighter p90. Modest because the dense-MoE path loads *all*
+  experts every token and FP8 only halves the *expert* bytes; the per-expert
+  M=1 launches and M=1 tensor-core inefficiency absorb much of the bandwidth
+  saving.
+- **Prefill (TTFT): comparable.** A multi-length sweep (113 / 561 / 1681 tokens)
+  gave FP8 480 / 362 / 2451 ms vs BF16 558 / 282 / 2287 ms — non-monotonic, i.e.
+  dominated by fixed overhead (cuBLAS lazy init + FP8's one-time per-shape
+  heuristic), not prefill compute, at these lengths.
+
+## Single-GPU (TP=1)
+
+FP8 runs gpt-oss-20b on **one** 5090 (`bench-gpt-oss --tp 1`, GPU6): TTFT 538 ms,
+TPOT 29.0 ms, 34.5 tok/s. BF16 cannot (39 GB > 32 GB). This — fitting a model
+that otherwise needs two GPUs onto one — is the largest practical win.
+
+## Follow-ups (not done)
+
+- Strided-batched FP8 (one call instead of ~768 per-expert launches per token) —
+  requires folding the per-expert weight scale into the post-scale kernel, at a
+  BF16-intermediate precision cost.
+- Per-channel (per-output-row) weight scales for better accuracy headroom than
+  per-tensor.
+- Warm common prefill shapes at load to hide the first-request heuristic stall.