Files
kernel-lab/kernels
Gahow Wang 165a1b0bd5 Implement all 5 Triton kernel labs
- vector_add: basic masked load/store with block indexing
- row_softmax: single-pass numerically stable softmax per row
- tiled_matmul: K-dimension tile loop with edge masking (IEEE precision)
- online_softmax: two-pass running max/sum recurrence across blocks
- flash_attention_fwd: blockwise Q/K/V with online softmax, causal support

All 26 tests pass on RTX 5090 (CUDA 12.8, Triton 3.6).
2026-05-15 20:46:04 +08:00
..
2026-04-10 13:22:19 +00:00
2026-05-15 20:46:04 +08:00
2026-04-10 13:22:19 +00:00