# FlashAttention Notes FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them. ## The Core Idea Instead of: 1. computing the full score matrix 2. writing it out 3. running softmax 4. reading it back 5. multiplying by `V` you process attention block by block and keep more intermediate state on chip. ## Why Online Softmax Matters Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability. ## What This Lab Covers - forward pass only - small-shape correctness first - optional causal masking - side-by-side Triton and CUDA skeletons This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.