docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc)

- known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix +
  regression test), with the meta-lesson that op/single-GPU unit tests can
  miss launcher-level integration gaps — only the V9-PILOT end-to-end run on
  the real launcher path exposed it.
- 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap
  and its T21 fix.
- evolution.md: T21 row (Infra) recording the fix + meta-lesson.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-18 21:22:49 +08:00
parent 980605474b
commit a1370446fe
3 changed files with 15 additions and 0 deletions

View File

@@ -131,6 +131,11 @@ forward 里保持不变**——本设计天然满足mask 只由 `(seed, i)`
- **DDPT8**:每 rank 独立跑自己的 forward/backward各自的 mask 由各 rank 的 `base_seed` 决定。
本任务的 DDP 闸门是「loss 对单卡 / 跨 rank 参数一致」,在 **dropout 关(默认 p=0** 的回归配置下跑,
不引入跨 rank mask 同步需求p>0 时各 rank mask 本就该不同,属正常 DDP 语义)。
- **⚠️ T18 的 launcher wiring gap → FIXED in T21**T18 只把 dropout 接进**单卡** `train.rs`
`train_ddp` bin/`train_rank` loop **没接**(无 `--dropout` flag、从不调 `model.train()`
所以 DDP 路径下 dropout 被静默忽略——V9-PILOT 全栈实跑才暴露op + 单卡测试覆盖不到 launcher 级)。
**T21** 补齐:`train_ddp``--dropout``train_rank` 每步 `model.train()`eval 后 restore
并加 DDP-dropout 回归测试p>0 下 dropout live + p=0 逐位一致)。见 known-issues「DDP-dropout wiring」。
- **梯度累积T16/ flashT14**:本分支独立于二者,不依赖其未合并改动。
## 验证方法