docs: T14 flash-attention results + evolution/README rows

Fill in the design doc's measured results (grad-check, flash==composed, PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to evolution.md (算法/Infra) and the README build-journey table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:34:10 +08:00
parent d217f4fbd3
commit 9064ced4c2
3 changed files with 36 additions and 5 deletions
--- a/README.md
+++ b/README.md
@@ -50,9 +50,12 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
 | **T11** | **device caching allocator** (fixes KI-5) | single-GPU 2.3×; **8-GPU 461K tok/s** |
 | **T12** | **bf16 mixed precision** (fp32 master, fixes KI-2) | dim768 OOM solved; −29% mem |
 | **T13** | **activation recompute** / checkpointing (fixes KI-3) | dim1024 fits; grads bit-identical |
+| **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) |

 The four performance fixes (T10–T13) each removed a real bottleneck — see
-[`docs/known-issues.md`](docs/known-issues.md).
+[`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14–)**
+revisits hand-writing deferred training-stack features; T14 = the fused
+flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md)).

 ## The scaling study — v0 → v8