Files
xserv/csrc/embedding
Gahow Wang c8e8153702 phase 4: transformer core kernels
CUDA kernels (csrc/):
- common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max
- normalization/rmsnorm.cu: RMSNorm (F32 + BF16)
- normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16)
- activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16)
- reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16)
- embedding/embedding.cu: gather lookup (F32 + BF16)
- embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16)

Rust wrappers (xserv-kernels/src/):
- rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs
- RopeCache struct with GPU-side precomputation

Tests: 12 new tests (ops_test.rs), all passing with good precision:
- F32: max_err 1e-6 ~ 1e-9
- BF16: max_err 2e-3 ~ 7e-3
Total: 29 kernel tests + 27 prior = 56 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:07:24 +08:00
..
2026-05-21 21:07:24 +08:00
2026-05-21 21:07:24 +08:00