kernel-lab

Files

Gahow Wang 165a1b0bd5 Implement all 5 Triton kernel labs

- vector_add: basic masked load/store with block indexing
- row_softmax: single-pass numerically stable softmax per row
- tiled_matmul: K-dimension tile loop with edge masking (IEEE precision)
- online_softmax: two-pass running max/sum recurrence across blocks
- flash_attention_fwd: blockwise Q/K/V with online softmax, causal support

All 26 tests pass on RTX 5090 (CUDA 12.8, Triton 3.6).

2026-05-15 20:46:04 +08:00

cuda

Initial project scaffold

2026-04-10 13:22:19 +00:00

triton

Implement all 5 Triton kernel labs

2026-05-15 20:46:04 +08:00

__init__.py

Initial project scaffold

2026-04-10 13:22:19 +00:00