xserv

Files

Gahow Wang 76487b7963 quantization: W8A8 FP8 compute via cuBLASLt tensor cores

Replace the W8A16 dequant→BF16-GEMM path with native FP8×FP8→BF16 GEMM
using cuBLASLt on Blackwell (RTX 5090). Both weights (static FP8 E4M3)
and activations (dynamically quantized per-row) are processed directly
on FP8 tensor cores.

Key implementation details:
- cuBLASLt on Blackwell requires transA=T for FP8, so expert weights
  are transposed during model loading ([E,K,N] → [E,N,K])
- Per-row activation quantization kernel (absmax/448 → FP8 E4M3)
- Post-GEMM row-wise rescaling recovers per-token precision
- Per-expert loop (not batched) due to cuBLASLt FP8 scale constraints

The same FP8 quantized model files work — no re-quantization needed.
Activation quantization happens dynamically at inference time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-06-07 20:38:26 +08:00

activation

phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2

2026-05-30 15:18:01 +08:00

attention

kernels: fix NaN in flash-attention sinks on fully-masked window tiles

2026-06-02 16:09:43 +08:00

embedding

kernels/cuda: paged-attention kernel, dispatch, pinned host memory

2026-05-28 19:58:36 +08:00

gemm

kernels: fix uninitialized shared-memory read in M=1 decode GEMV