29 lines
951 B
Markdown
29 lines
951 B
Markdown
# FlashAttention Notes
|
|
|
|
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
|
|
|
|
## The Core Idea
|
|
|
|
Instead of:
|
|
|
|
1. computing the full score matrix
|
|
2. writing it out
|
|
3. running softmax
|
|
4. reading it back
|
|
5. multiplying by `V`
|
|
|
|
you process attention block by block and keep more intermediate state on chip.
|
|
|
|
## Why Online Softmax Matters
|
|
|
|
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
|
|
|
|
## What This Lab Covers
|
|
|
|
- forward pass only
|
|
- small-shape correctness first
|
|
- optional causal masking
|
|
- side-by-side Triton and CUDA skeletons
|
|
|
|
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.
|