docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix + regression test), with the meta-lesson that op/single-GPU unit tests can miss launcher-level integration gaps — only the V9-PILOT end-to-end run on the real launcher path exposed it. - 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap and its T21 fix. - evolution.md: T21 row (Infra) recording the fix + meta-lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
parent 980605474b
commit a1370446fe
3 changed files with 15 additions and 0 deletions
--- a/docs/17-dropout.md
+++ b/docs/17-dropout.md
@@ -131,6 +131,11 @@ forward 里保持不变**——本设计天然满足：mask 只由 `(seed, i)`
 - **DDP（T8）**：每 rank 独立跑自己的 forward/backward，各自的 mask 由各 rank 的 `base_seed` 决定。
  本任务的 DDP 闸门是「loss 对单卡 / 跨 rank 参数一致」，在 **dropout 关（默认 p=0）** 的回归配置下跑，
  不引入跨 rank mask 同步需求（p>0 时各 rank mask 本就该不同，属正常 DDP 语义）。
+  - **⚠️ T18 的 launcher wiring gap → FIXED in T21**：T18 只把 dropout 接进**单卡** `train.rs`，
+    `train_ddp` bin/`train_rank` loop **没接**（无 `--dropout` flag、从不调 `model.train()`），
+    所以 DDP 路径下 dropout 被静默忽略——V9-PILOT 全栈实跑才暴露（op + 单卡测试覆盖不到 launcher 级）。
+    **T21** 补齐：`train_ddp` 加 `--dropout`、`train_rank` 每步 `model.train()`（eval 后 restore），
+    并加 DDP-dropout 回归测试（p>0 下 dropout live + p=0 逐位一致）。见 known-issues「DDP-dropout wiring」。
 - **梯度累积（T16）/ flash（T14）**：本分支独立于二者，不依赖其未合并改动。

 ## 验证方法