Gahow Wang
3a530956af
tools: add FP8 vs BF16 benchmark and GSM8K eval harness
bench_fp8.py — head-to-head comparison of FP8 and BF16 models on
GSM8K / AIME2025 accuracy plus TTFT/TPOT performance measurement.
eval_gsm8k_batch.sh — lightweight GSM8K accuracy evaluator that
pipes one problem per xserv-chat invocation and scores with
\boxed{} / last-number extraction.
Benchmark results (gpt-oss-20b, 50-problem GSM8K):
FP8 W8A8 TP1 : 94.0% (single RTX 5090, 25 GB)
FP8 W8A16 TP1: 94.0%
BF16 TP2 : 94.0% (requires 2× RTX 5090)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-08 15:43:04 +08:00
..
2026-05-29 18:45:59 +08:00
2026-05-28 11:40:07 +08:00
2026-05-21 23:39:41 +08:00
2026-05-22 10:25:33 +08:00
2026-05-21 23:29:41 +08:00
2026-06-08 15:43:04 +08:00
2026-05-30 15:39:44 +08:00
2026-05-23 14:13:49 +08:00
2026-05-22 17:53:28 +08:00
2026-05-22 17:53:28 +08:00
2026-05-28 11:18:52 +08:00
2026-05-22 17:53:28 +08:00
2026-06-07 19:33:07 +08:00
2026-06-08 15:43:04 +08:00
2026-06-08 15:43:04 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-06-07 19:33:07 +08:00
2026-05-30 15:39:44 +08:00
2026-05-28 11:18:52 +08:00
2026-05-28 11:18:52 +08:00
2026-05-22 17:53:28 +08:00
2026-05-23 14:13:49 +08:00
2026-06-08 15:43:04 +08:00