2be27d6d9454352e73ed133318521c7373e93f5c
New CUDA kernels (csrc/embedding/transpose.cu): - reshape_heads_bf16: [S, H*D] → [1, H, S, D] - merge_heads_bf16: [1, H, S, D] → [S, H*D] - transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE) - transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE) - repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D] Rust wrappers (xserv-kernels/src/transpose.rs): - reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips. Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms) Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle creations per token (Phase 15 targets) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%