|
|
165a1b0bd5
|
Implement all 5 Triton kernel labs
- vector_add: basic masked load/store with block indexing
- row_softmax: single-pass numerically stable softmax per row
- tiled_matmul: K-dimension tile loop with edge masking (IEEE precision)
- online_softmax: two-pass running max/sum recurrence across blocks
- flash_attention_fwd: blockwise Q/K/V with online softmax, causal support
All 26 tests pass on RTX 5090 (CUDA 12.8, Triton 3.6).
|
2026-05-15 20:46:04 +08:00 |
|