perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)

Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling") FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU, back-to-back A/B: before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB after (batched): 25627 tok/s (batch16) / 40263 (batch32), util 37% mean / 54% peak, ~10 GB → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%. A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x. The v3 falsification history (larger batch doesn't help a single-seq design) is kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching exposes (eager all-reduce of all params each step) → recorded as KI-5 (bucketed/overlapped all-reduce), out of T10 scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:43 +08:00
parent 25b032445d
commit 4ccab0fb42
1 changed files with 29 additions and 1 deletions
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -7,7 +7,35 @@

 ## Open

-### KI-1 · DDP 弱扩展性（吞吐受单序列 launch-bound 限制）— `P1` · 由 v2 暴露，v3 重新诊断
+_(none — KI-1 fixed in T10; see Fixed below. KI-5 is the DDP-scaling follow-up batching newly exposed.)_
+
+### KI-5 · DDP 弱扩展性（all-reduce 每步全参数，未分桶 / 未与 backward overlap）— `P2` · 由 T10 暴露
+- **现象**：batched forward 修掉单卡 launch-bound 后，dim384/per-rank batch 32：1 卡 40.3K → 4 卡 47.2K tok/s（global），仅 ~1.17×。
+- **根因**：单卡 compute 快了 15–24× 后，每步对全部 ~67M 参数的 **eager all-reduce + host 侧 optimizer/clip 同步**成了 DDP 主导开销，不随卡数缩小。注意**单卡 batch 32 = 40K tok/s 已是 KI-1 时代 4 卡(3163)的 ~12×**——根因已修，这是新的、更上层的瓶颈。
+- **拟修复**：梯度 **bucketed all-reduce + 与 backward 计算 overlap**（即 KI-1 修复项 2，此前 all-reduce 非瓶颈做了无收益，batched 之后才有意义）。可选：optimizer/clip 进一步去 host 同步。
+- **重启条件**：v3 训练若被 DDP 扩展性卡住再做；单卡吞吐已足够，v3 可先单卡 / 小 world 跑。
+
+---
+
+## Fixed
+
+### KI-1 · 单序列 launch-bound（"DDP 弱扩展性"的根因）— `FIXED` (T10, batched forward)
+- **修复**：T10 给 model + autograd 加 batch 维——linears 摊平成 `[B*S, dim]` 一个大 GEMM 填满 GPU；attention 走 fused 批量 SDPA（`cublasSgemmStridedBatched` ×2 + 一个 causal-softmax kernel），RoPE 位置 per-sequence 复位（`row % S`）；训练 loop 用真 batch 一次 forward/backward 替代 "loop B 次 + SUM"。详见 [docs/09-batched-forward.md](09-batched-forward.md)。
+- **before → after**（dim384/12L/12h, batch 16, seq 256, 1 卡, back-to-back A/B）：
+  | | tok/s | GPU util | 显存 |
+  |---|---|---|---|
+  | before（单序列 launch-bound）| ~1653 | 0–15% | ~3 GB |
+  | after（batched）| **25627**（batch16）/ **40263**（batch32）| **37% 均值 / 54% 峰** | ~10 GB |
+
+  → 单卡 **~15.5×（batch16）/ ~24×（batch32）**，util 0–15% → 37–54%。
+- **正确性（全绿，无回归）**：15 算子 grad-check（新增 batched-rope / transpose_4d12 / batched-attention dQ/dK/dV）、batched==looped 单序列（logits 0.0、grad 6.4e-4）、**PyTorch 对拍 B>1**（loss 5e-8 / logits 6.9e-6 / 全参数 grad 在 rtol 2e-2）、overfit 27/27、checkpoint 逐位、AdamW 对 torch、DDP loss 对单卡 5.7e-7 + 跨 rank 参数 bit-identical(0.0)、**xserv 加载导出权重对 xtrain 贪心仍逐 token 一致**（top token 同序、BF16 漂移 ~0.03）。
+- **commit**：见 T10 提交链（`perf: KI-1 fixed — GPU util / tok/s` 那条带 before/after）。
+- **DDP 残留弱扩展性 → KI-5**（这是 batching 后新暴露的 all-reduce 瓶颈，不是 KI-1 的单序列根因）。
+- **历史诊断保留如下**（v2 暴露 → v3 重诊断的过程，证明根因不是 all-reduce）：
+
+---
+
+### KI-1 历史诊断 · DDP 弱扩展性（吞吐受单序列 launch-bound 限制）— v2 暴露，v3 重新诊断
 - **现象**：4 卡 DDP 仅 ~3.2K tok/s，几乎不快于单卡（≈2× over 单卡，远低于近线性；T8 在 tiny micro-bench 为 3.0×@4）。
 - **复现**：`dim384/12L, world=4, seq 256`。
 - **v3 实测（dash5, 4× RTX 5090, dim384, 隔离 back-to-back A/B）**：