Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher
path (DdpContext::init + train_rank). It would have caught the original bug:
- GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is
byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the
step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce
short-circuits and one step has no optimizer-state compounding; the only
residual non-determinism is the engine's atomicAdd backward-reduction
order (the documented fresh-train md5 caveat — dropout-independent), so the
post-step params are checked against that tight ULP floor (< 1e-7).
- GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's
run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible
on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it.
- GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 —
orders of magnitude above every noise floor here (~3e-2 observed). On the
pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and
the trace would match p=0 at the noise floor — this gate fails. (Verified by
simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.)
- GATE C: a dedicated no-eval run ends with model.is_training() == true,
direct proof that train_rank called model.train().
- p>0 run is finite (no NaN/Inf).
eval_every < steps so a periodic eval fires mid-run (flipping to eval mode),
exercising the per-step model.train() restore discipline the pilot called out.
Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>