Files
xserv/csrc
Gahow Wang 2be27d6d94 perf: GPU transpose/reshape/repeat_kv kernels (eliminate CPU round-trips)
New CUDA kernels (csrc/embedding/transpose.cu):
- reshape_heads_bf16: [S, H*D] → [1, H, S, D]
- merge_heads_bf16: [1, H, S, D] → [S, H*D]
- transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE)
- transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE)
- repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D]

Rust wrappers (xserv-kernels/src/transpose.rs):
- reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu

Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips.

Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms)
Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle
creations per token (Phase 15 targets)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 12:01:07 +08:00
..
2026-05-21 21:07:24 +08:00
2026-05-21 21:07:24 +08:00