Design doc for the hand-written single fused flash-attention kernel: online softmax tiled over KV, NEVER materializing the [bh,S,S] score matrix; flash-style backward (recompute scores from saved logsumexp + D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>