phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
Two optimizations:
1. Tensor::empty() — skip cudaMemset for output tensors
All kernel wrappers that fully overwrite their output now use
Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
cudaMemset calls per decode step (21 per layer × 36 layers).
Improvement: 46.6 → 50.3 tok/s (+8%).
2. CUDA Graph infrastructure (for future use)
Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
forward pass due to variable kv_len, but provides foundation for
future graph-based decode optimization.
Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):
| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>