test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)

Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
parent 81f3cf59e5
commit 980605474b
1 changed files with 225 additions and 0 deletions
--- a/crates/xtrain-distributed/tests/ddp_correctness.rs
+++ b/crates/xtrain-distributed/tests/ddp_correctness.rs
@@ -38,6 +38,53 @@ fn test_config(vocab: usize) -> Config {
    cfg
 }

+/// Run `cfg`/`dcfg` as a DDP job over `devices` (the same launcher path as
+/// production — `DdpContext::init` + `train_rank` per rank) and return rank 0's
+/// (loss trace, final params on host, final `is_training()` flag). `cfg` carries
+/// the dropout prob; `dcfg` carries the loop knobs. Caller asserts.
+///
+/// `world == 1` is the deterministic path: `all_reduce_average_grads` short-circuits
+/// (no NCCL collective), so the run is bit-reproducible — used for the bit-identity
+/// gate. `world >= 2` exercises the real cross-rank NCCL all-reduce, which is not
+/// bit-reproducible run-to-run on this PCIe box (KI-5), so those gates use the same
+/// ULP/relative tolerances as the rest of this file.
+fn run_ddp(
+    devices: &[u32],
+    cfg: Config,
+    corpus: &Corpus,
+    valid: Option<&Corpus>,
+    dcfg: &DdpConfig,
+) -> (Vec<f32>, Vec<Vec<f32>>, bool) {
+    let world = devices.len();
+    let id = get_unique_id();
+    let results: Vec<(Vec<f32>, Vec<Vec<f32>>, bool)> = std::thread::scope(|s| {
+        let handles: Vec<_> = devices
+            .iter()
+            .enumerate()
+            .map(|(rank, &dev)| {
+                let dcfg = dcfg.clone();
+                let corpus = &corpus;
+                s.spawn(move || {
+                    let ctx = DdpContext::init(rank, world, id, dev);
+                    let device = Device::Cuda(dev);
+                    let model = build_model(cfg, device);
+                    // Only rank 0 holds the val corpus (mirrors launch()).
+                    let v = if rank == 0 { valid } else { None };
+                    let res = train_rank(&ctx, &model, device, corpus, v, &dcfg);
+                    let host = model
+                        .params()
+                        .iter()
+                        .map(|p| p.value().to_device(Device::Cpu).as_slice::<f32>().to_vec())
+                        .collect::<Vec<_>>();
+                    (res.losses, host, model.is_training())
+                })
+            })
+            .collect();
+        handles.into_iter().map(|h| h.join().unwrap()).collect()
+    });
+    results.into_iter().next().unwrap()
+}
+
 // Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
 // batch is processed on one device. Returns (loss trace, final params on host).
 fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
@@ -386,3 +433,181 @@ fn ddp_throughput_scaling() {
        );
    }
 }
+
+/// T21 regression: prove dropout is actually LIVE under DDP (with `p>0`), and that
+/// `p=0` is bit-identical to the no-dropout path. Guards the V9-PILOT launcher-
+/// wiring gap — `train_ddp` had no `--dropout` flag and `train_rank` never called
+/// `model.train()`, so under DDP every forward ran in the default eval mode and
+/// dropout was a silent identity regardless of config. Op/single-GPU tests never
+/// exercised dropout-under-DDP, so it slipped through; this test runs the REAL
+/// launcher path (`DdpContext::init` + `train_rank`).
+///
+/// On the pre-T21 code, both load-bearing gates FAIL: GATE B (p>0 trace would be
+/// bit-identical to p=0 — model stuck in eval mode → dropout is identity) and GATE C
+/// (`is_training()` would be false after the run).
+///
+/// p=0 regression (GATE A) is checked at `world=1`, ONE step, where the NCCL
+/// all-reduce short-circuits: the p=0 FORWARD is byte-identical to no-dropout so the
+/// loss is BIT-IDENTICAL (== 0.0), and the post-step params match within the engine's
+/// atomicAdd backward-reduction ULP floor (< 1e-7, dropout-independent — the
+/// fresh-train md5 caveat). The cross-rank NCCL all-reduce (`world>=2`) is not
+/// bit-reproducible run-to-run on this PCIe box (KI-5, observed ≤~2.4e-7), so the
+/// `world=2` p=0-vs-no-dropout check (GATE A2) uses the same KI-5 ULP tolerance as the
+/// rest of this file. GATE B's live-dropout signal (>1e-3) sits ~4 orders of magnitude
+/// above every noise floor here, so it carries the load.
+#[test]
+fn ddp_dropout_is_live_and_p0_bit_identical() {
+    if device::device_count().unwrap_or(0) < 2 {
+        eprintln!("skip: need >= 2 GPUs");
+        return;
+    }
+
+    let vocab = 64usize;
+    let corpus = synth_corpus(vocab, 4096);
+    let steps = 20usize;
+    // eval_every < steps so a periodic eval fires MID-run (flipping the model to
+    // eval mode via eval_loss → model.eval()). The per-step model.train() must
+    // restore training mode so dropout stays live across the eval boundary — this is
+    // exactly the train/eval discipline the pilot called out. A held-out slice gives
+    // rank 0 something to eval on.
+    let valid = synth_corpus(vocab, 512);
+    let base_dcfg = DdpConfig {
+        seq_len: 32,
+        batch_size: 8, // global; 4 per rank with world=2
+        accum_steps: 1,
+        steps,
+        schedule: LrSchedule {
+            max_lr: 3e-3,
+            min_lr: 3e-4,
+            warmup: 3,
+            total: steps,
+        },
+        weight_decay: 0.1,
+        max_grad_norm: 1.0,
+        log_every: 1_000_000, // silence per-step logging
+        seed: 7,
+        eval_every: 7, // fires at steps 6, 13, 19 — flips to eval mode mid-run
+        eval_batches: 4,
+        ckpt_path: None,
+    };
+
+    // --- GATE A: p=0 == no-dropout at world=1, ONE step (the deterministic scope). ---
+    // The regression guard for `--dropout 0`. ops::dropout(p=0) returns x.clone() (a
+    // graph no-op) regardless of training mode, so the p=0 FORWARD graph is byte-for-
+    // byte the no-dropout forward → loss[0] must be BIT-IDENTICAL (the load-bearing
+    // claim, asserted == 0.0). At world=1 the NCCL all-reduce short-circuits, and one
+    // step has no optimizer-state compounding; the only residual non-determinism is
+    // the engine's atomicAdd backward-reduction ORDER (the documented fresh-train md5
+    // caveat — dropout-INDEPENDENT, present with or without the dropout op), which
+    // moves the post-step params by a single grad ULP. So params are checked against
+    // that tight reduction floor (< 1e-7), the same nature as the cross-rank KI-5
+    // tolerance used elsewhere in this file — not a dropout signal. GATE B (live) has
+    // a >1e-3 signal, ~4 orders of magnitude above this floor, so it carries the load.
+    let d1 = [0u32];
+    let dcfg_1step = DdpConfig {
+        steps: 1,
+        eval_every: 0,
+        ..base_dcfg.clone()
+    };
+    let cfg_nodrop = test_config(vocab); // cfg.dropout defaults to 0.0
+    assert_eq!(cfg_nodrop.dropout, 0.0, "baseline cfg must have dropout 0");
+    let mut cfg_p0 = test_config(vocab);
+    cfg_p0.dropout = 0.0; // explicitly set p=0 — must not perturb anything
+    let (loss_nd1, params_nd1, _) = run_ddp(&d1, cfg_nodrop, &corpus, None, &dcfg_1step);
+    let (loss_p01, params_p01, _) = run_ddp(&d1, cfg_p0, &corpus, None, &dcfg_1step);
+    let max_loss_diff_1 = (loss_nd1[0] - loss_p01[0]).abs();
+    let max_param_diff_1 = params_nd1
+        .iter()
+        .zip(&params_p01)
+        .flat_map(|(a, b)| a.iter().zip(b).map(|(x, y)| (x - y).abs()))
+        .fold(0.0f32, f32::max);
+    println!(
+        "T21 GATE A (world=1, 1 step, p=0 vs no-dropout): |loss diff| = {max_loss_diff_1:.3e} \
+         (bit-identical forward), max |param diff| = {max_param_diff_1:.3e} (atomicAdd floor)"
+    );
+    assert_eq!(
+        max_loss_diff_1, 0.0,
+        "world=1 p=0 forward loss not bit-identical to no-dropout path"
+    );
+    assert!(
+        max_param_diff_1 < 1e-7,
+        "world=1 p=0 post-step params diverged from no-dropout beyond the atomicAdd \
+         reduction floor: {max_param_diff_1:.3e}"
+    );
+
+    // --- world=2 runs: real cross-rank NCCL all-reduce (the production path). ---
+    let d2 = [0u32, 1u32];
+    let mut cfg_p0_w2 = test_config(vocab);
+    cfg_p0_w2.dropout = 0.0;
+    let mut cfg_p_w2 = test_config(vocab);
+    cfg_p_w2.dropout = 0.2;
+    let (loss_p0_2, _params_p0_2, _) = run_ddp(&d2, cfg_p0_w2, &corpus, Some(&valid), &base_dcfg);
+    let (loss_p_2, _params_p_2, _) = run_ddp(&d2, cfg_p_w2, &corpus, Some(&valid), &base_dcfg);
+
+    // GATE A2 — under DDP (world=2), p=0 matches a separate no-dropout baseline within
+    // NCCL's run-to-run ULP noise (KI-5; the all-reduce is not bit-reproducible). This
+    // confirms enabling dropout=0 doesn't perturb the DDP path beyond that noise floor.
+    let (loss_nd_2, _, _) = run_ddp(&d2, test_config(vocab), &corpus, Some(&valid), &base_dcfg);
+    let max_loss_diff_2 = loss_nd_2
+        .iter()
+        .zip(&loss_p0_2)
+        .map(|(a, b)| (a - b).abs())
+        .fold(0.0f32, f32::max);
+    println!(
+        "T21 GATE A2 (world=2 p=0 vs no-dropout, KI-5 noise): max |loss diff| = {max_loss_diff_2:.3e}"
+    );
+    assert!(
+        max_loss_diff_2 < 1e-6,
+        "world=2 p=0 diverged from no-dropout beyond NCCL noise: {max_loss_diff_2:.3e}"
+    );
+
+    // GATE B — dropout is LIVE with p>0 under DDP. If model.train() were not wired
+    // (the pre-T21 bug), the model would stay in eval mode and the p=0.2 forward would
+    // be IDENTITY → loss trace bit-identical to p=0 (diff at the ~1e-7 NCCL noise
+    // floor). A difference orders of magnitude above that proves dropout masks are
+    // actually applied during the training forward — and that they survive the mid-run
+    // eval flips (model.train() is re-asserted each step). Inverted scaling + masking
+    // perturbs every step, so the gap is large (>1e-3 ≫ KI-5 noise ~2.4e-7).
+    let max_live_diff = loss_p0_2
+        .iter()
+        .zip(&loss_p_2)
+        .map(|(a, b)| (a - b).abs())
+        .fold(0.0f32, f32::max);
+    println!(
+        "T21 GATE B (dropout live, world=2): p0[last]={:.6} p0.2[last]={:.6} max |loss diff| = {max_live_diff:.3e}",
+        loss_p0_2.last().unwrap(),
+        loss_p_2.last().unwrap()
+    );
+    assert!(
+        max_live_diff > 1e-3,
+        "p=0.2 DDP loss trace matches p=0 — dropout is NOT live under DDP \
+         (model.train() not wired): max |loss diff| {max_live_diff:.3e}"
+    );
+
+    // No NaN/Inf in the p>0 run (dropout converges normally under DDP).
+    assert!(
+        loss_p_2.iter().all(|l| l.is_finite()),
+        "p=0.2 DDP loss has non-finite values"
+    );
+
+    // GATE C — train_rank actually sets TRAINING mode (direct, complementary proof of
+    // model.train() being wired). Use a dedicated short run with eval_every=0 so no
+    // eval fires: a model that finishes a training step in training mode proves
+    // train_rank called model.train(). (With eval enabled, eval_loss → model.eval()
+    // runs LAST on the final step and legitimately leaves the model in eval mode —
+    // same as the single-GPU loop — so is_training() after an eval-enabled run reflects
+    // the final eval, not the training-mode wiring. GATE B already proves dropout
+    // survives the mid-run eval flips via the per-step model.train() restore.) On the
+    // pre-T21 code is_training() stays false (model never left the default eval mode).
+    let dcfg_noeval = DdpConfig {
+        steps: 2,
+        eval_every: 0,
+        ..base_dcfg.clone()
+    };
+    let (_, _, train_flag) = run_ddp(&d1, cfg_p_w2, &corpus, None, &dcfg_noeval);
+    assert!(
+        train_flag,
+        "model not in training mode after a no-eval DDP run — model.train() not wired in train_rank"
+    );
+    println!("T21 GATE C (train_rank sets training mode): is_training() == true ✅");
+}