kernel-lab/docs/flashattention_notes.md

# FlashAttention Notes

FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.

## The Core Idea

Instead of:

1. computing the full score matrix
2. writing it out
3. running softmax
4. reading it back
5. multiplying by `V`

you process attention block by block and keep more intermediate state on chip.

## Why Online Softmax Matters

Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.

## What This Lab Covers

- forward pass only
- small-shape correctness first
- optional causal masking
- side-by-side Triton and CUDA skeletons

This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.