From 511ceebbb322fc28be55ef3fd1c7b3060998873b Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Tue, 16 Jun 2026 13:14:42 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20KI-2=20trigger=20=E2=80=94=20dim768=20f?=
 =?UTF-8?q?p32=20batch-32=20OOM?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32
(global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation
mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as
the backlog item it is (still deferred).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/known-issues.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/known-issues.md b/docs/known-issues.md
index b5058e5..e2a9f23 100644
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -94,6 +94,7 @@ _(KI-1 fixed in T10. KI-5 **FIXED** in T11——device caching/pool allocator 
 ### KI-2 · bf16 混合精度（fp32 master）— `deferred`
 - T7 延后理由：tiny 规模延迟瓶颈、bf16 改变数值会威胁 fp32 正确性闸门。
 - **重启条件**：模型放大（v2+ `dim≥384`）后 GEMM 渐成 compute-bound，tensor-core 收益显现。需 fp32 master weights + 单独 looser-tol 测试 + 收敛对比。
+- **具体触发点（v4 surfaced）**：dim768 fp32 在单卡 32GB 显存里 per-rank batch 32（global 256）OOM，被迫降到 per-rank 16（global 128）训练。bf16（激活减半）能把 batch-256 的甜点区找回来。这是 v0–v3 tiny 规模延后 bf16 后第一次有 fp32 放不下的硬约束——v5 该先拉的杠杆。
 
 ### KI-3 · 激活重计算（gradient checkpointing）— `deferred`
 - T7 延后理由：单序列、显存不紧。