xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	9f1fbbb98b	quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem) with per-expert FP32 scale factors. At inference, a fused CUDA kernel dequantizes to BF16 before the existing cuBLAS batched GEMM. Results on gpt-oss-20b (50-problem GSM8K subset): - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM) - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total) No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%). New files: - tools/quantize_fp8.py: offline BF16→FP8 conversion script - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-07 19:33:07 +08:00

Author

SHA1

Message

Date

Gahow Wang

9f1fbbb98b

quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights

Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem)
with per-expert FP32 scale factors. At inference, a fused CUDA kernel
dequantizes to BF16 before the existing cuBLAS batched GEMM.

Results on gpt-oss-20b (50-problem GSM8K subset):
  - FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM)
  - BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total)

No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%).

New files:
  - tools/quantize_fp8.py: offline BF16→FP8 conversion script
  - csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel
  - crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper
  - tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-06-07 19:33:07 +08:00

1 Commits