Files
xserv/crates
Gahow Wang d5532ef209 phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
Two optimizations:

1. Tensor::empty() — skip cudaMemset for output tensors
   All kernel wrappers that fully overwrite their output now use
   Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
   cudaMemset calls per decode step (21 per layer × 36 layers).
   Improvement: 46.6 → 50.3 tok/s (+8%).

2. CUDA Graph infrastructure (for future use)
   Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
   cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
   forward pass due to variable kv_len, but provides foundation for
   future graph-based decode optimization.

Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):

| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 23:57:34 +08:00
..