docs: KI-2 trigger — dim768 fp32 batch-32 OOM

v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32
(global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation
mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as
the backlog item it is (still deferred).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 13:14:42 +08:00
parent ff79fee3c5
commit 511ceebbb3

View File

@@ -94,6 +94,7 @@ _(KI-1 fixed in T10. KI-5 **FIXED** in T11——device caching/pool allocator
### KI-2 · bf16 混合精度fp32 master— `deferred`
- T7 延后理由tiny 规模延迟瓶颈、bf16 改变数值会威胁 fp32 正确性闸门。
- **重启条件**模型放大v2+ `dim≥384`)后 GEMM 渐成 compute-boundtensor-core 收益显现。需 fp32 master weights + 单独 looser-tol 测试 + 收敛对比。
- **具体触发点v4 surfaced**dim768 fp32 在单卡 32GB 显存里 per-rank batch 32global 256OOM被迫降到 per-rank 16global 128训练。bf16激活减半能把 batch-256 的甜点区找回来。这是 v0v3 tiny 规模延后 bf16 后第一次有 fp32 放不下的硬约束——v5 该先拉的杠杆。
### KI-3 · 激活重计算gradient checkpointing— `deferred`
- T7 延后理由:单序列、显存不紧。