Gahow Wang
9f1fbbb98b
quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights
Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem)
with per-expert FP32 scale factors. At inference, a fused CUDA kernel
dequantizes to BF16 before the existing cuBLAS batched GEMM.
Results on gpt-oss-20b (50-problem GSM8K subset):
- FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM)
- BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total)
No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%).
New files:
- tools/quantize_fp8.py: offline BF16→FP8 conversion script
- csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel
- crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper
- tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-07 19:33:07 +08:00
..
2026-05-29 18:45:59 +08:00
2026-05-28 11:40:07 +08:00
2026-05-21 23:39:41 +08:00
2026-05-22 10:25:33 +08:00
2026-05-21 23:29:41 +08:00
2026-05-30 15:39:44 +08:00
2026-05-23 14:13:49 +08:00
2026-05-22 17:53:28 +08:00
2026-05-22 17:53:28 +08:00
2026-05-28 11:18:52 +08:00
2026-05-22 17:53:28 +08:00
2026-06-07 19:33:07 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-05-29 18:45:59 +08:00
2026-06-07 19:33:07 +08:00
2026-05-30 15:39:44 +08:00
2026-05-28 11:18:52 +08:00
2026-05-28 11:18:52 +08:00
2026-05-22 17:53:28 +08:00
2026-05-23 14:13:49 +08:00