perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:
before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
after (batched): 25627 tok/s (batch16) / 40263 (batch32),
util 37% mean / 54% peak, ~10 GB
→ single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.
A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -7,7 +7,35 @@
|
||||
|
||||
## Open
|
||||
|
||||
### KI-1 · DDP 弱扩展性(吞吐受单序列 launch-bound 限制)— `P1` · 由 v2 暴露,v3 重新诊断
|
||||
_(none — KI-1 fixed in T10; see Fixed below. KI-5 is the DDP-scaling follow-up batching newly exposed.)_
|
||||
|
||||
### KI-5 · DDP 弱扩展性(all-reduce 每步全参数,未分桶 / 未与 backward overlap)— `P2` · 由 T10 暴露
|
||||
- **现象**:batched forward 修掉单卡 launch-bound 后,dim384/per-rank batch 32:1 卡 40.3K → 4 卡 47.2K tok/s(global),仅 ~1.17×。
|
||||
- **根因**:单卡 compute 快了 15–24× 后,每步对全部 ~67M 参数的 **eager all-reduce + host 侧 optimizer/clip 同步**成了 DDP 主导开销,不随卡数缩小。注意**单卡 batch 32 = 40K tok/s 已是 KI-1 时代 4 卡(3163)的 ~12×**——根因已修,这是新的、更上层的瓶颈。
|
||||
- **拟修复**:梯度 **bucketed all-reduce + 与 backward 计算 overlap**(即 KI-1 修复项 2,此前 all-reduce 非瓶颈做了无收益,batched 之后才有意义)。可选:optimizer/clip 进一步去 host 同步。
|
||||
- **重启条件**:v3 训练若被 DDP 扩展性卡住再做;单卡吞吐已足够,v3 可先单卡 / 小 world 跑。
|
||||
|
||||
---
|
||||
|
||||
## Fixed
|
||||
|
||||
### KI-1 · 单序列 launch-bound("DDP 弱扩展性"的根因)— `FIXED` (T10, batched forward)
|
||||
- **修复**:T10 给 model + autograd 加 batch 维——linears 摊平成 `[B*S, dim]` 一个大 GEMM 填满 GPU;attention 走 fused 批量 SDPA(`cublasSgemmStridedBatched` ×2 + 一个 causal-softmax kernel),RoPE 位置 per-sequence 复位(`row % S`);训练 loop 用真 batch 一次 forward/backward 替代 "loop B 次 + SUM"。详见 [docs/09-batched-forward.md](09-batched-forward.md)。
|
||||
- **before → after**(dim384/12L/12h, batch 16, seq 256, 1 卡, back-to-back A/B):
|
||||
| | tok/s | GPU util | 显存 |
|
||||
|---|---|---|---|
|
||||
| before(单序列 launch-bound)| ~1653 | 0–15% | ~3 GB |
|
||||
| after(batched)| **25627**(batch16)/ **40263**(batch32)| **37% 均值 / 54% 峰** | ~10 GB |
|
||||
|
||||
→ 单卡 **~15.5×(batch16)/ ~24×(batch32)**,util 0–15% → 37–54%。
|
||||
- **正确性(全绿,无回归)**:15 算子 grad-check(新增 batched-rope / transpose_4d12 / batched-attention dQ/dK/dV)、batched==looped 单序列(logits 0.0、grad 6.4e-4)、**PyTorch 对拍 B>1**(loss 5e-8 / logits 6.9e-6 / 全参数 grad 在 rtol 2e-2)、overfit 27/27、checkpoint 逐位、AdamW 对 torch、DDP loss 对单卡 5.7e-7 + 跨 rank 参数 bit-identical(0.0)、**xserv 加载导出权重对 xtrain 贪心仍逐 token 一致**(top token 同序、BF16 漂移 ~0.03)。
|
||||
- **commit**:见 T10 提交链(`perf: KI-1 fixed — GPU util / tok/s` 那条带 before/after)。
|
||||
- **DDP 残留弱扩展性 → KI-5**(这是 batching 后新暴露的 all-reduce 瓶颈,不是 KI-1 的单序列根因)。
|
||||
- **历史诊断保留如下**(v2 暴露 → v3 重诊断的过程,证明根因不是 all-reduce):
|
||||
|
||||
---
|
||||
|
||||
### KI-1 历史诊断 · DDP 弱扩展性(吞吐受单序列 launch-bound 限制)— v2 暴露,v3 重新诊断
|
||||
- **现象**:4 卡 DDP 仅 ~3.2K tok/s,几乎不快于单卡(≈2× over 单卡,远低于近线性;T8 在 tiny micro-bench 为 3.0×@4)。
|
||||
- **复现**:`dim384/12L, world=4, seq 256`。
|
||||
- **v3 实测(dash5, 4× RTX 5090, dim384, 隔离 back-to-back A/B)**:
|
||||
|
||||
Reference in New Issue
Block a user