Initial project scaffold
This commit is contained in:
28
docs/flashattention_notes.md
Normal file
28
docs/flashattention_notes.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# FlashAttention Notes
|
||||
|
||||
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
|
||||
|
||||
## The Core Idea
|
||||
|
||||
Instead of:
|
||||
|
||||
1. computing the full score matrix
|
||||
2. writing it out
|
||||
3. running softmax
|
||||
4. reading it back
|
||||
5. multiplying by `V`
|
||||
|
||||
you process attention block by block and keep more intermediate state on chip.
|
||||
|
||||
## Why Online Softmax Matters
|
||||
|
||||
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
|
||||
|
||||
## What This Lab Covers
|
||||
|
||||
- forward pass only
|
||||
- small-shape correctness first
|
||||
- optional causal masking
|
||||
- side-by-side Triton and CUDA skeletons
|
||||
|
||||
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.
|
||||
Reference in New Issue
Block a user