Files
kernel-lab/docs/flashattention_notes.md
2026-04-10 13:22:19 +00:00

951 B

FlashAttention Notes

FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.

The Core Idea

Instead of:

  1. computing the full score matrix
  2. writing it out
  3. running softmax
  4. reading it back
  5. multiplying by V

you process attention block by block and keep more intermediate state on chip.

Why Online Softmax Matters

Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.

What This Lab Covers

  • forward pass only
  • small-shape correctness first
  • optional causal masking
  • side-by-side Triton and CUDA skeletons

This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.