Initial project scaffold

This commit is contained in:
2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions

View File

@@ -0,0 +1,28 @@
# FlashAttention Notes
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
## The Core Idea
Instead of:
1. computing the full score matrix
2. writing it out
3. running softmax
4. reading it back
5. multiplying by `V`
you process attention block by block and keep more intermediate state on chip.
## Why Online Softmax Matters
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
## What This Lab Covers
- forward pass only
- small-shape correctness first
- optional causal masking
- side-by-side Triton and CUDA skeletons
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.