Merge t16-grad-accum into main

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # docs/evolution.md
2026-06-18 00:37:11 +08:00
parent 9e958cb0f9 8bd7db16e1
commit f26db882e5
10 changed files with 697 additions and 48 deletions
--- a/README.md
+++ b/README.md
@@ -51,11 +51,17 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
 | **T12** | **bf16 mixed precision** (fp32 master, fixes KI-2) | dim768 OOM solved; −29% mem |
 | **T13** | **activation recompute** / checkpointing (fixes KI-3) | dim1024 fits; grads bit-identical |
 | **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) |
+| **T16** | **gradient accumulation** (`--accum-steps`; DDP all-reduces only at the boundary) | equiv to N× big batch (grad 3.8e-5); same effective-64 batch 27.7GB→7.2GB (−74%) |
+| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |

 The four performance fixes (T10–T13) each removed a real bottleneck — see
 [`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14–)**
-revisits hand-writing deferred training-stack features; T14 = the fused
-flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md)).
+revisits hand-writing deferred training-stack features: T14 = the fused
+flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md));
+T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
+which decouples the effective batch from activation memory (memory tracks the micro-batch,
+not N×); T18 = dropout ([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based
+device RNG + mask, inverted scaling, train/eval switch).

 ## The scaling study — v0 → v8