Files
kernel-lab/docs/flashattention_notes.md
2026-04-10 13:22:19 +00:00

29 lines
951 B
Markdown

# FlashAttention Notes
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
## The Core Idea
Instead of:
1. computing the full score matrix
2. writing it out
3. running softmax
4. reading it back
5. multiplying by `V`
you process attention block by block and keep more intermediate state on chip.
## Why Online Softmax Matters
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
## What This Lab Covers
- forward pass only
- small-shape correctness first
- optional causal masking
- side-by-side Triton and CUDA skeletons
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.