FlashAttention Notes

FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.

The Core Idea

Instead of:

computing the full score matrix
writing it out
running softmax
reading it back
multiplying by V

you process attention block by block and keep more intermediate state on chip.

Why Online Softmax Matters

Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.

What This Lab Covers

forward pass only
small-shape correctness first
optional causal masking
side-by-side Triton and CUDA skeletons

This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.

951 B Raw Blame History

FlashAttention Notes

The Core Idea

Why Online Softmax Matters

What This Lab Covers

951 B

Raw Blame History