Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/docs/flashattention_notes.md
+++ b/docs/flashattention_notes.md
@@ -0,0 +1,28 @@
+# FlashAttention Notes
+
+FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
+
+## The Core Idea
+
+Instead of:
+
+1. computing the full score matrix
+2. writing it out
+3. running softmax
+4. reading it back
+5. multiplying by `V`
+
+you process attention block by block and keep more intermediate state on chip.
+
+## Why Online Softmax Matters
+
+Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
+
+## What This Lab Covers
+
+- forward pass only
+- small-shape correctness first
+- optional causal masking
+- side-by-side Triton and CUDA skeletons
+
+This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.