docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11)
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:
- grad all-reduce is only ~6-7% of each step;
- per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
SAME per-rank workload) and dominates;
- coalescing the ~150 per-tensor all-reduces into one grouped/flat
launch gives ~0 scaling gain AND breaks cross-rank bit-identity
(max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
the coalescing commit (b8b5821) was reverted.
Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -7,13 +7,26 @@
|
||||
|
||||
## Open
|
||||
|
||||
_(none — KI-1 fixed in T10; see Fixed below. KI-5 is the DDP-scaling follow-up batching newly exposed.)_
|
||||
_(KI-1 fixed in T10. KI-5 仍 Open,但 T11 实测把根因从「all-reduce 未分桶」**改诊断**为「单进程多 rank 的逐 rank compute 互相串行」——见下。原拟修复(分桶 all-reduce)经实测证伪。)_
|
||||
|
||||
### KI-5 · DDP 弱扩展性(all-reduce 每步全参数,未分桶 / 未与 backward overlap)— `P2` · 由 T10 暴露
|
||||
### KI-5 · DDP 弱扩展性 — `P2` · 由 T10 暴露,T11 重新诊断(all-reduce **不是**瓶颈)
|
||||
- **现象**:batched forward 修掉单卡 launch-bound 后,dim384/per-rank batch 32:1 卡 40.3K → 4 卡 47.2K tok/s(global),仅 ~1.17×。
|
||||
- **根因**:单卡 compute 快了 15–24× 后,每步对全部 ~67M 参数的 **eager all-reduce + host 侧 optimizer/clip 同步**成了 DDP 主导开销,不随卡数缩小。注意**单卡 batch 32 = 40K tok/s 已是 KI-1 时代 4 卡(3163)的 ~12×**——根因已修,这是新的、更上层的瓶颈。
|
||||
- **拟修复**:梯度 **bucketed all-reduce + 与 backward 计算 overlap**(即 KI-1 修复项 2,此前 all-reduce 非瓶颈做了无收益,batched 之后才有意义)。可选:optimizer/clip 进一步去 host 同步。
|
||||
- **重启条件**:v3 训练若被 DDP 扩展性卡住再做;单卡吞吐已足够,v3 可先单卡 / 小 world 跑。
|
||||
- **T11 实测(dash5, 8× RTX 5090, dim384/12L, per-rank batch 32, seq 256, 原 ungrouped all-reduce, 50 步均, ms/step)**:
|
||||
|
||||
| world | fwd+bwd | grad all-reduce | clip+opt+zero | TOTAL | tok/s(global) | speedup |
|
||||
|---|---|---|---|---|---|---|
|
||||
| 1 | 136 | 0 | 8.6 | 145 | 36582 | 1.00× |
|
||||
| 2 | 202 | 21 | 15 | 238 | 47267 | 1.29× |
|
||||
| 4 | 342 | 29 | 21 | 392 | 51466 | 1.41× |
|
||||
| 8 | 780 | 54 | 47 | 882 | 47719 | 1.30× |
|
||||
|
||||
→ grad all-reduce 每步只占 **~6–7%**;真正爆炸的是**逐 rank 的 fwd+bwd 时间随 world 线性膨胀**(同一 per-rank workload,136→780ms,~6×)。
|
||||
- **「分桶 all-reduce」拟修复经 T11 实测证伪**:
|
||||
- ① 把 ~150 个 per-tensor `ncclAllReduce` 用 `ncclGroupStart/End` 融成一发 → 1/2/4/8 卡 = 1.00/1.30/1.42/1.34×,**与不分桶几乎无差**(因为 all-reduce 本就只占 7%)。
|
||||
- ② 分桶/grouped/flat 还会**破坏跨 rank bit-identical**:correctness 测试里 `max|p0−p1|` 从 `0.0` 变 `1.49e-8`(1 ULP,逐步 AdamW 累积)——NCCL 只对**单个 ungrouped collective** 保证跨 rank 逐位一致,grouped/大 message 会换 algorithm/chunking 扰动结果。**违反 T8 硬闸门**,故保留原 ungrouped 路径。
|
||||
- **重新定位的根因**:**单进程 thread-per-GPU 模型下,N 个 rank 线程各自跑独立训练却互相串行**——`NOCOMM=1`(完全不做任何跨 rank 通信/barrier)时 fwd+bwd 仍 136→378→800ms 膨胀;`nvidia-smi` 抽样显示 8 卡同一时刻只有 1–2 张在忙、轮流跑。排除项:CPU 不缺(187 核, load 2.5);`nvcc --default-stream per-thread` 不解决。**剩余怀疑:每个 op 输出走 `Tensor::zeros`→`cudaMalloc`+`cudaMemset`,而 `cudaMalloc` 是同步、进程级串行的 driver 调用;单 CUDA context 下 N rank 每步几百次 alloc 互相排队**——即 DDP 真瓶颈是 **per-op 显存分配 / driver 调用在单进程内串行**,不是梯度通信。
|
||||
- **真正的修复方向(待定,非 T11 范围)**:① **caching/pool allocator**(op 输出复用显存,消掉每步几百次 `cudaMalloc`,单卡也受益);或 ② **process-per-GPU**(每 rank 独立 CUDA context,torchrun 式,彻底解串行,但要改 launcher + 跨进程 UniqueId 分发)。先做 ① 再实测是否解 DDP 串行。
|
||||
- **重启条件**:多卡 v4 需要扩展性时做。**单卡 batched 已 40K tok/s(v3 即单卡训完)**,多卡当前只有 ~1.4× 上限,v4 若要多卡须先修上面的真瓶颈。
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user