docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky)

The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the original ungrouped all-reduce is itself run-to-run nondeterministic on this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}), so the T8 test's `max|p0-p1| == 0.0` assertion is flaky here (passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP, numerically benign; loss-match stays ~6e-7). Noted as a follow-up to loosen the assertion to a tight tolerance; coalescing was reverted purely because it gives ~0 scaling benefit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:42:13 +08:00
parent 84092fb28d
commit d422c68704
1 changed files with 2 additions and 3 deletions
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -21,9 +21,8 @@ _(KI-1 fixed in T10. KI-5 仍 Open，但 T11 实测把根因从「all-reduce 未
  | 8 | 780 | 54 | 47 | 882 | 47719 | 1.30× |

  → grad all-reduce 每步只占 **~6–7%**；真正爆炸的是**逐 rank 的 fwd+bwd 时间随 world 线性膨胀**（同一 per-rank workload，136→780ms，~6×）。
- **「分桶 all-reduce」拟修复经 T11 实测证伪**：
-  - ① 把 ~150 个 per-tensor `ncclAllReduce` 用 `ncclGroupStart/End` 融成一发 → 1/2/4/8 卡 = 1.00/1.30/1.42/1.34×，**与不分桶几乎无差**（因为 all-reduce 本就只占 7%）。
-  - ② 分桶/grouped/flat 还会**破坏跨 rank bit-identical**：correctness 测试里 `max|p0−p1|` 从 `0.0` 变 `1.49e-8`（1 ULP，逐步 AdamW 累积）——NCCL 只对**单个 ungrouped collective** 保证跨 rank 逐位一致，grouped/大 message 会换 algorithm/chunking 扰动结果。**违反 T8 硬闸门**，故保留原 ungrouped 路径。
+- **「分桶 all-reduce」拟修复经 T11 实测证伪（无收益）**：把 ~150 个 per-tensor `ncclAllReduce` 用 `ncclGroupStart/End` 融成一发 → 1/2/4/8 卡 = 1.00/1.30/1.42/1.34×，**与不分桶几乎无差**（all-reduce 本就只占 7%）。flat-buffer 分桶同理。故回退（revert b8b5821），保留原 ungrouped 路径。
+- **附带发现：T8 correctness 测试的 `max|p0−p1| == 0.0` 在本机 flaky**（与 T11 无关）。原 ungrouped 代码同一 GPU 重跑 6 次 cross-rank diff = {0.0, 0.0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}，只 ~1/3 命中 `0.0`。即本机/本版 NCCL 的 all-reduce **run-to-run 跨 rank 不是逐位可复现**（PCIe-only 拓扑下 algorithm/chunk 选择不稳）。diff 都 ≤1.19e-7（几 ULP，数值无害，loss-match 仍 ~6e-7），但 `== 0.0` 断言过严 → 建议改为 `< 1e-6` 紧容差（**留作 follow-up，本次未改测试**）。
 - **重新定位的根因**：**单进程 thread-per-GPU 模型下，N 个 rank 线程各自跑独立训练却互相串行**——`NOCOMM=1`（完全不做任何跨 rank 通信/barrier）时 fwd+bwd 仍 136→378→800ms 膨胀；`nvidia-smi` 抽样显示 8 卡同一时刻只有 1–2 张在忙、轮流跑。排除项：CPU 不缺（187 核, load 2.5）；`nvcc --default-stream per-thread` 不解决。**剩余怀疑：每个 op 输出走 `Tensor::zeros`→`cudaMalloc`+`cudaMemset`，而 `cudaMalloc` 是同步、进程级串行的 driver 调用；单 CUDA context 下 N rank 每步几百次 alloc 互相排队**——即 DDP 真瓶颈是 **per-op 显存分配 / driver 调用在单进程内串行**，不是梯度通信。
 - **真正的修复方向（待定，非 T11 范围）**：① **caching/pool allocator**（op 输出复用显存，消掉每步几百次 `cudaMalloc`，单卡也受益）；或 ② **process-per-GPU**（每 rank 独立 CUDA context，torchrun 式，彻底解串行，但要改 launcher + 跨进程 UniqueId 分发）。先做 ① 再实测是否解 DDP 串行。
 - **重启条件**：多卡 v4 需要扩展性时做。**单卡 batched 已 40K tok/s（v3 即单卡训完）**，多卡当前只有 ~1.4× 上限，v4 若要多卡须先修上面的真瓶颈。