|
|
3a530956af
|
tools: add FP8 vs BF16 benchmark and GSM8K eval harness
bench_fp8.py — head-to-head comparison of FP8 and BF16 models on
GSM8K / AIME2025 accuracy plus TTFT/TPOT performance measurement.
eval_gsm8k_batch.sh — lightweight GSM8K accuracy evaluator that
pipes one problem per xserv-chat invocation and scores with
\boxed{} / last-number extraction.
Benchmark results (gpt-oss-20b, 50-problem GSM8K):
FP8 W8A8 TP1 : 94.0% (single RTX 5090, 25 GB)
FP8 W8A16 TP1: 94.0%
BF16 TP2 : 94.0% (requires 2× RTX 5090)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-06-08 15:43:04 +08:00 |
|