Commit Graph

2 Commits

Author SHA1 Message Date
9064ced4c2 docs: T14 flash-attention results + evolution/README rows
Fill in the design doc's measured results (grad-check, flash==composed,
PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to
evolution.md (算法/Infra) and the README build-journey table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:34:10 +08:00
65a2264227 docs: Phase T14 — fused flash-attention design
Design doc for the hand-written single fused flash-attention kernel:
online softmax tiled over KV, NEVER materializing the [bh,S,S] score
matrix; flash-style backward (recompute scores from saved logsumexp +
D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:16 +08:00