kernel-lab

2 Commits 2 Branches 0 Tags

Author	SHA1	Message	Date
Gahow Wang	165a1b0bd5	Implement all 5 Triton kernel labs - vector_add: basic masked load/store with block indexing - row_softmax: single-pass numerically stable softmax per row - tiled_matmul: K-dimension tile loop with edge masking (IEEE precision) - online_softmax: two-pass running max/sum recurrence across blocks - flash_attention_fwd: blockwise Q/K/V with online softmax, causal support All 26 tests pass on RTX 5090 (CUDA 12.8, Triton 3.6).	2026-05-15 20:46:04 +08:00
Gahow Wang	7fa69b1354	Initial project scaffold	2026-04-10 13:22:19 +00:00

Author

SHA1

Message

Date

Gahow Wang

165a1b0bd5

Implement all 5 Triton kernel labs

- vector_add: basic masked load/store with block indexing
- row_softmax: single-pass numerically stable softmax per row
- tiled_matmul: K-dimension tile loop with edge masking (IEEE precision)
- online_softmax: two-pass running max/sum recurrence across blocks
- flash_attention_fwd: blockwise Q/K/V with online softmax, causal support

All 26 tests pass on RTX 5090 (CUDA 12.8, Triton 3.6).

2026-05-15 20:46:04 +08:00

Gahow Wang

7fa69b1354

Initial project scaffold

2026-04-10 13:22:19 +00:00