The plan-cache fix removed the per-expert heuristic churn but still issued one
cublasLtMatmul per expert: ~768 tiny launches per decoded token (16 local
experts × 2 GEMMs × 24 layers), which capped the FP8 decode win at ~1.05× over
BF16. Collapse each MoE GEMM into ONE strided-batched cuBLASLt FP8 matmul
(BATCH_COUNT + strided-batch offsets on all four layouts) → ~48 launches/token.
A single strided call can't carry a per-batch scalar B-scale, so the per-expert
weight scale moves out of the GEMM epilogue into a fused post-scale kernel
(rowwise_scale_moe_bf16) that applies a_scale[token]·b_scale[expert] in one
pass. This is precision-equivalent: BF16's relative error is scale-invariant, so
scaling the unscaled GEMM output afterward loses nothing vs scaling in-epilogue.
Measured on dash5 (gpt-oss-20b, TP=2, 5090), warm-server GSM8K:
decode TPOT 17.45 → 13.08 ms (FP8 now 1.41× vs BF16 18.39 ms),
throughput 57.3 → 76.4 tok/s, accuracy unchanged (FP8 91.0% vs BF16 90.0%).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fp8_compare.py launches one xserv-server per model (same GPUs / TP for a
fair comparison), gates readiness on a real generation (not /health),
and streams GSM8K through /v1/chat/completions measuring per-request
TTFT (time to first token) and TPOT (mean inter-token latency) plus
exact-match accuracy. docs/benchmarks/fp8-quantization.md records the
quantization scheme, the perf-bug fix, and the dash5 results.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>