xserv

Files

Gahow Wang be5c64ea8a phase 10: GPU add/mul kernels + BF16 precision analysis

Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips

GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)

BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
  (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32

Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 11:35:26 +08:00

activation

phase 10: GPU add/mul kernels + BF16 precision analysis

2026-05-22 11:35:26 +08:00

attention

phase 5: naive multi-head attention

2026-05-21 21:17:23 +08:00

embedding

phase 4: transformer core kernels

2026-05-21 21:07:24 +08:00

gemm

phase 3: GEMM kernels (naive, tiled, cuBLAS)