perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16,
batch32 seq256, steady-state):
- Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL —
fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within
tol", exactly equal). Full suite green with recompute on/off; DDP loss-match
5.67e-7; DDP+recompute 2-rank descends 11.079→6.010.
- dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s
39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band).
- dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607
MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges.
→ KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed.
Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>