perf: GPU transpose/reshape/repeat_kv kernels (eliminate CPU round-trips)
New CUDA kernels (csrc/embedding/transpose.cu):
- reshape_heads_bf16: [S, H*D] → [1, H, S, D]
- merge_heads_bf16: [1, H, S, D] → [S, H*D]
- transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE)
- transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE)
- repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D]
Rust wrappers (xserv-kernels/src/transpose.rs):
- reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu
Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips.
Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms)
Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle
creations per token (Phase 15 targets)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>