docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix + regression test), with the meta-lesson that op/single-GPU unit tests can miss launcher-level integration gaps — only the V9-PILOT end-to-end run on the real launcher path exposed it. - 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap and its T21 fix. - evolution.md: T21 row (Infra) recording the fix + meta-lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
parent 980605474b
commit a1370446fe
3 changed files with 15 additions and 0 deletions
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -13,6 +13,15 @@ _(KI-1 fixed in T10. KI-5 fixed in T11. KI-2 fixed in T12. **KI-3（激活重计

 ## Fixed

+### DDP-dropout wiring（launcher 漏接 dropout）— `FIXED` (T21)
+- **背景（V9-PILOT 暴露）**：T18 dropout 在**单卡** `train.rs` 完整接好（`--dropout` flag → `cfg.dropout`，每步 `model.train()`，eval 用 `model.eval()`），op 级 + 单卡都测过。但 V9-PILOT 全栈端到端跑（DDP 8 卡 + dropout0.1 + flash + GQA + accum + bf16）时发现 **DDP 路径根本没接 dropout**：`train_ddp` bin **无 `--dropout` flag、从不设 `cfg.dropout`**，且 `ddp.rs::train_rank` **从不调 `model.train()`** → 模型停在默认 eval 模式，`ops::dropout` 恒等 → DDP 下 dropout **被静默忽略，无论 config 怎么设**。模型 + autodiff 完全支持 dropout（T18），漏的纯是 **DDP launcher / 训练 loop 的 wiring**。
+- **为何 op/单卡测试没抓到**：dropout 的测试只覆盖**单卡训练循环 + op 级 grad-check**，从没在 **DDP 路径下**跑过 dropout。`train_rank` 是独立于单卡 `train()` 的另一条 loop，二者共享 model/autodiff 但**各自布线 train/eval 纪律** —— 单卡那条对了不代表 DDP 那条对。**元教训：op 级 + 单 GPU 单元测试能漏掉 launcher 级 integration gap**；只有把特性放进**真实启动器路径**端到端跑（pilot 做的事）才暴露。修复随手补了 DDP 路径的回归测试堵这个缺口。
+- **修复（[docs/17-dropout.md](17-dropout.md)）**：① `train_ddp.rs` 加 `--dropout <p>` flag（默认 0 = 关，对齐旧行为）并设 `cfg.dropout`；② `ddp.rs::train_rank` 每步 micro-batch 循环前调 `model.train()`，镜像单卡 loop 的 train/eval 纪律——**关键**：`eval_loss()` 内部 `model.eval()` 翻成 eval 模式且**不还原**，所以每步重新 assert `model.train()`，dropout 才能跨 eval 边界保持活跃。
+- **正确性（新增 DDP-dropout 回归测试 `ddp_dropout_is_live_and_p0_bit_identical`，跑真实 `train_rank` 启动器路径）**：① **GATE A**——`p=0` 下 DDP loss 轨迹 + 末态参数对无 dropout 路径**逐位一致**（`ops::dropout(p=0)` 是 clone no-op，回归保护）；② **GATE B**——`p=0.2` 的 loss 轨迹对 `p=0` **有可观差异**（>1e-3），证 dropout mask 真在训练 forward 应用（pre-T21 代码停在 eval 模式 → 二者会逐位相同，此 gate 会 FAIL）；③ **GATE C**——run 后 `model.is_training()==true`（直接证 `model.train()` 被调用且跨末步 eval 存活）；④ p>0 run 无 NaN/Inf。测试故意启用 `eval_every < steps` 让 eval 中途翻 eval 模式，验证每步 `model.train()` 的 restore 纪律。默认 `--dropout 0` 下既有 DDP loss-match + 跨 rank 测试**不变**（回归保护）。
+- **commit**：见 T21 提交链（`distributed: --dropout flag + model.train() per step in train_rank` / `test: DDP-dropout regression (live under DDP + p=0 bit-identical)` / 文档更新）。
+
+---
+
 ### process-per-GPU（torchrun 式独立 CUDA context）— `CLOSED / 实测负结果` (T17)
 - **背景**：KI-5（T11）修掉 per-op `cudaMalloc` 串行后，8 卡 scaling 从 ~1.3× 恢复到 **~5×@8**，但残留 ~5×@8 非完美线性。T11 doc / KI-5「残留」推测下一步是 **process-per-GPU**（每 rank 独立进程 + 独立 CUDA context，torchrun 式）——理由是「N rank 线程共享单 CUDA primary context，kernel-launch/cuBLAS 仍在 context 级串行」。**T17 把这条 torchrun 式链路落地并实测，证伪了该推测。**
 - **实现（[docs/16-process-per-gpu.md](16-process-per-gpu.md)）**：`xtrain-distributed` 加 `proc.rs`——`launch_processes` 每卡 spawn 一个 worker 进程（re-exec current_exe + `XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID}` env）；**launcher 一次性铸 `ncclUniqueId` 后 hex 编码注入子进程 env**（无共享 FS/TCP、无轮询、无竞态——id 在子进程出生前就原子就绪）；worker 读 env → bind device（独立 CUDA context）→ `DdpContext::init` + `build_model` + `train_rank` **全部复用 T8 零改动**。新 `train_ddp_mp` bin + `ddp_proc` test；**保留 thread-per-GPU 旧路径**（回归 baseline）。scope=process-per-GPU only（ZeRO-1 用户 drop）。