docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)
- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix + regression test), with the meta-lesson that op/single-GPU unit tests can miss launcher-level integration gaps — only the V9-PILOT end-to-end run on the real launcher path exposed it. - 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap and its T21 fix. - evolution.md: T21 row (Infra) recording the fix + meta-lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -131,6 +131,11 @@ forward 里保持不变**——本设计天然满足:mask 只由 `(seed, i)`
|
||||
- **DDP(T8)**:每 rank 独立跑自己的 forward/backward,各自的 mask 由各 rank 的 `base_seed` 决定。
|
||||
本任务的 DDP 闸门是「loss 对单卡 / 跨 rank 参数一致」,在 **dropout 关(默认 p=0)** 的回归配置下跑,
|
||||
不引入跨 rank mask 同步需求(p>0 时各 rank mask 本就该不同,属正常 DDP 语义)。
|
||||
- **⚠️ T18 的 launcher wiring gap → FIXED in T21**:T18 只把 dropout 接进**单卡** `train.rs`,
|
||||
`train_ddp` bin/`train_rank` loop **没接**(无 `--dropout` flag、从不调 `model.train()`),
|
||||
所以 DDP 路径下 dropout 被静默忽略——V9-PILOT 全栈实跑才暴露(op + 单卡测试覆盖不到 launcher 级)。
|
||||
**T21** 补齐:`train_ddp` 加 `--dropout`、`train_rank` 每步 `model.train()`(eval 后 restore),
|
||||
并加 DDP-dropout 回归测试(p>0 下 dropout live + p=0 逐位一致)。见 known-issues「DDP-dropout wiring」。
|
||||
- **梯度累积(T16)/ flash(T14)**:本分支独立于二者,不依赖其未合并改动。
|
||||
|
||||
## 验证方法
|
||||
|
||||
Reference in New Issue
Block a user