perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)

Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:

  before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
  after  (batched):    25627 tok/s (batch16) / 40263 (batch32),
                       util 37% mean / 54% peak, ~10 GB
  → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.

A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 00:44:43 +08:00
parent 25b032445d
commit 4ccab0fb42

View File

@@ -7,7 +7,35 @@
## Open
### KI-1 · DDP 弱扩展性(吞吐受单序列 launch-bound 限制)— `P1` · 由 v2 暴露v3 重新诊断
_(none — KI-1 fixed in T10; see Fixed below. KI-5 is the DDP-scaling follow-up batching newly exposed.)_
### KI-5 · DDP 弱扩展性all-reduce 每步全参数,未分桶 / 未与 backward overlap— `P2` · 由 T10 暴露
- **现象**batched forward 修掉单卡 launch-bound 后dim384/per-rank batch 321 卡 40.3K → 4 卡 47.2K tok/sglobal仅 ~1.17×。
- **根因**:单卡 compute 快了 1524× 后,每步对全部 ~67M 参数的 **eager all-reduce + host 侧 optimizer/clip 同步**成了 DDP 主导开销,不随卡数缩小。注意**单卡 batch 32 = 40K tok/s 已是 KI-1 时代 4 卡(3163)的 ~12×**——根因已修,这是新的、更上层的瓶颈。
- **拟修复**:梯度 **bucketed all-reduce + 与 backward 计算 overlap**(即 KI-1 修复项 2此前 all-reduce 非瓶颈做了无收益batched 之后才有意义。可选optimizer/clip 进一步去 host 同步。
- **重启条件**v3 训练若被 DDP 扩展性卡住再做单卡吞吐已足够v3 可先单卡 / 小 world 跑。
---
## Fixed
### KI-1 · 单序列 launch-bound"DDP 弱扩展性"的根因)— `FIXED` (T10, batched forward)
- **修复**T10 给 model + autograd 加 batch 维——linears 摊平成 `[B*S, dim]` 一个大 GEMM 填满 GPUattention 走 fused 批量 SDPA`cublasSgemmStridedBatched` ×2 + 一个 causal-softmax kernelRoPE 位置 per-sequence 复位(`row % S`);训练 loop 用真 batch 一次 forward/backward 替代 "loop B 次 + SUM"。详见 [docs/09-batched-forward.md](09-batched-forward.md)。
- **before → after**dim384/12L/12h, batch 16, seq 256, 1 卡, back-to-back A/B
| | tok/s | GPU util | 显存 |
|---|---|---|---|
| before单序列 launch-bound| ~1653 | 015% | ~3 GB |
| afterbatched| **25627**batch16/ **40263**batch32| **37% 均值 / 54% 峰** | ~10 GB |
→ 单卡 **~15.5×batch16/ ~24×batch32**util 015% → 3754%。
- **正确性(全绿,无回归)**15 算子 grad-check新增 batched-rope / transpose_4d12 / batched-attention dQ/dK/dV、batched==looped 单序列logits 0.0、grad 6.4e-4、**PyTorch 对拍 B>1**loss 5e-8 / logits 6.9e-6 / 全参数 grad 在 rtol 2e-2、overfit 27/27、checkpoint 逐位、AdamW 对 torch、DDP loss 对单卡 5.7e-7 + 跨 rank 参数 bit-identical(0.0)、**xserv 加载导出权重对 xtrain 贪心仍逐 token 一致**top token 同序、BF16 漂移 ~0.03)。
- **commit**:见 T10 提交链(`perf: KI-1 fixed — GPU util / tok/s` 那条带 before/after
- **DDP 残留弱扩展性 → KI-5**(这是 batching 后新暴露的 all-reduce 瓶颈,不是 KI-1 的单序列根因)。
- **历史诊断保留如下**v2 暴露 → v3 重诊断的过程,证明根因不是 all-reduce
---
### KI-1 历史诊断 · DDP 弱扩展性(吞吐受单序列 launch-bound 限制)— v2 暴露v3 重新诊断
- **现象**4 卡 DDP 仅 ~3.2K tok/s几乎不快于单卡≈2× over 单卡远低于近线性T8 在 tiny micro-bench 为 3.0×@4)。
- **复现**`dim384/12L, world=4, seq 256`
- **v3 实测dash5, 4× RTX 5090, dim384, 隔离 back-to-back A/B**