test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)

Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 21:22:49 +08:00
parent 81f3cf59e5
commit 980605474b
1 changed files with 225 additions and 0 deletions
--- a/crates/xtrain-distributed/tests/ddp_correctness.rs
+++ b/crates/xtrain-distributed/tests/ddp_correctness.rs
@@ -38,6 +38,53 @@ fn test_config(vocab: usize) -> Config {
    cfg
 }
 /// Run `cfg`/`dcfg` as a DDP job over `devices` (the same launcher path as
 /// production — `DdpContext::init` + `train_rank` per rank) and return rank 0's
 /// (loss trace, final params on host, final `is_training()` flag). `cfg` carries
 /// the dropout prob; `dcfg` carries the loop knobs. Caller asserts.
 ///
 /// `world == 1` is the deterministic path: `all_reduce_average_grads` short-circuits
 /// (no NCCL collective), so the run is bit-reproducible — used for the bit-identity
 /// gate. `world >= 2` exercises the real cross-rank NCCL all-reduce, which is not
 /// bit-reproducible run-to-run on this PCIe box (KI-5), so those gates use the same
 /// ULP/relative tolerances as the rest of this file.
 fn run_ddp(
    devices: &[u32],
    cfg: Config,
    corpus: &Corpus,
    valid: Option<&Corpus>,
    dcfg: &DdpConfig,
 ) -> (Vec<f32>, Vec<Vec<f32>>, bool) {
    let world = devices.len();
    let id = get_unique_id();
    let results: Vec<(Vec<f32>, Vec<Vec<f32>>, bool)> = std::thread::scope(|s| {
        let handles: Vec<_> = devices
            .iter()
            .enumerate()
            .map(|(rank, &dev)| {
                let dcfg = dcfg.clone();
                let corpus = &corpus;
                s.spawn(move || {
                    let ctx = DdpContext::init(rank, world, id, dev);
                    let device = Device::Cuda(dev);
                    let model = build_model(cfg, device);
                    // Only rank 0 holds the val corpus (mirrors launch()).
                    let v = if rank == 0 { valid } else { None };
                    let res = train_rank(&ctx, &model, device, corpus, v, &dcfg);
                    let host = model
                        .params()
                        .iter()
                        .map(|p| p.value().to_device(Device::Cpu).as_slice::<f32>().to_vec())
                        .collect::<Vec<_>>();
                    (res.losses, host, model.is_training())
                })
            })
            .collect();
        handles.into_iter().map(|h| h.join().unwrap()).collect()
    });
    results.into_iter().next().unwrap()
 }
 // Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
 // batch is processed on one device. Returns (loss trace, final params on host).
 fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
@@ -386,3 +433,181 @@ fn ddp_throughput_scaling() {
        );
    }
 }
 /// T21 regression: prove dropout is actually LIVE under DDP (with `p>0`), and that
 /// `p=0` is bit-identical to the no-dropout path. Guards the V9-PILOT launcher-
 /// wiring gap — `train_ddp` had no `--dropout` flag and `train_rank` never called
 /// `model.train()`, so under DDP every forward ran in the default eval mode and
 /// dropout was a silent identity regardless of config. Op/single-GPU tests never
 /// exercised dropout-under-DDP, so it slipped through; this test runs the REAL
 /// launcher path (`DdpContext::init` + `train_rank`).
 ///
 /// On the pre-T21 code, both load-bearing gates FAIL: GATE B (p>0 trace would be
 /// bit-identical to p=0 — model stuck in eval mode → dropout is identity) and GATE C
 /// (`is_training()` would be false after the run).
 ///
 /// p=0 regression (GATE A) is checked at `world=1`, ONE step, where the NCCL
 /// all-reduce short-circuits: the p=0 FORWARD is byte-identical to no-dropout so the
 /// loss is BIT-IDENTICAL (== 0.0), and the post-step params match within the engine's
 /// atomicAdd backward-reduction ULP floor (< 1e-7, dropout-independent — the
 /// fresh-train md5 caveat). The cross-rank NCCL all-reduce (`world>=2`) is not
 /// bit-reproducible run-to-run on this PCIe box (KI-5, observed ≤~2.4e-7), so the
 /// `world=2` p=0-vs-no-dropout check (GATE A2) uses the same KI-5 ULP tolerance as the
 /// rest of this file. GATE B's live-dropout signal (>1e-3) sits ~4 orders of magnitude
 /// above every noise floor here, so it carries the load.
 #[test]
 fn ddp_dropout_is_live_and_p0_bit_identical() {
    if device::device_count().unwrap_or(0) < 2 {
        eprintln!("skip: need >= 2 GPUs");
        return;
    }
    let vocab = 64usize;
    let corpus = synth_corpus(vocab, 4096);
    let steps = 20usize;
    // eval_every < steps so a periodic eval fires MID-run (flipping the model to
    // eval mode via eval_loss → model.eval()). The per-step model.train() must
    // restore training mode so dropout stays live across the eval boundary — this is
    // exactly the train/eval discipline the pilot called out. A held-out slice gives
    // rank 0 something to eval on.
    let valid = synth_corpus(vocab, 512);
    let base_dcfg = DdpConfig {
        seq_len: 32,
        batch_size: 8, // global; 4 per rank with world=2
        accum_steps: 1,
        steps,
        schedule: LrSchedule {
            max_lr: 3e-3,
            min_lr: 3e-4,
            warmup: 3,
            total: steps,
        },
        weight_decay: 0.1,
        max_grad_norm: 1.0,
        log_every: 1_000_000, // silence per-step logging
        seed: 7,
        eval_every: 7, // fires at steps 6, 13, 19 — flips to eval mode mid-run
        eval_batches: 4,
        ckpt_path: None,
    };
    // --- GATE A: p=0 == no-dropout at world=1, ONE step (the deterministic scope). ---
    // The regression guard for `--dropout 0`. ops::dropout(p=0) returns x.clone() (a
    // graph no-op) regardless of training mode, so the p=0 FORWARD graph is byte-for-
    // byte the no-dropout forward → loss[0] must be BIT-IDENTICAL (the load-bearing
    // claim, asserted == 0.0). At world=1 the NCCL all-reduce short-circuits, and one
    // step has no optimizer-state compounding; the only residual non-determinism is
    // the engine's atomicAdd backward-reduction ORDER (the documented fresh-train md5
    // caveat — dropout-INDEPENDENT, present with or without the dropout op), which
    // moves the post-step params by a single grad ULP. So params are checked against
    // that tight reduction floor (< 1e-7), the same nature as the cross-rank KI-5
    // tolerance used elsewhere in this file — not a dropout signal. GATE B (live) has
    // a >1e-3 signal, ~4 orders of magnitude above this floor, so it carries the load.
    let d1 = [0u32];
    let dcfg_1step = DdpConfig {
        steps: 1,
        eval_every: 0,
        ..base_dcfg.clone()
    };
    let cfg_nodrop = test_config(vocab); // cfg.dropout defaults to 0.0
    assert_eq!(cfg_nodrop.dropout, 0.0, "baseline cfg must have dropout 0");
    let mut cfg_p0 = test_config(vocab);
    cfg_p0.dropout = 0.0; // explicitly set p=0 — must not perturb anything
    let (loss_nd1, params_nd1, _) = run_ddp(&d1, cfg_nodrop, &corpus, None, &dcfg_1step);
    let (loss_p01, params_p01, _) = run_ddp(&d1, cfg_p0, &corpus, None, &dcfg_1step);
    let max_loss_diff_1 = (loss_nd1[0] - loss_p01[0]).abs();
    let max_param_diff_1 = params_nd1
        .iter()
        .zip(&params_p01)
        .flat_map(|(a, b)| a.iter().zip(b).map(|(x, y)| (x - y).abs()))
        .fold(0.0f32, f32::max);
    println!(
        "T21 GATE A (world=1, 1 step, p=0 vs no-dropout): |loss diff| = {max_loss_diff_1:.3e} \
         (bit-identical forward), max |param diff| = {max_param_diff_1:.3e} (atomicAdd floor)"
    );
    assert_eq!(
        max_loss_diff_1, 0.0,
        "world=1 p=0 forward loss not bit-identical to no-dropout path"
    );
    assert!(
        max_param_diff_1 < 1e-7,
        "world=1 p=0 post-step params diverged from no-dropout beyond the atomicAdd \
         reduction floor: {max_param_diff_1:.3e}"
    );
    // --- world=2 runs: real cross-rank NCCL all-reduce (the production path). ---
    let d2 = [0u32, 1u32];
    let mut cfg_p0_w2 = test_config(vocab);
    cfg_p0_w2.dropout = 0.0;
    let mut cfg_p_w2 = test_config(vocab);
    cfg_p_w2.dropout = 0.2;
    let (loss_p0_2, _params_p0_2, _) = run_ddp(&d2, cfg_p0_w2, &corpus, Some(&valid), &base_dcfg);
    let (loss_p_2, _params_p_2, _) = run_ddp(&d2, cfg_p_w2, &corpus, Some(&valid), &base_dcfg);
    // GATE A2 — under DDP (world=2), p=0 matches a separate no-dropout baseline within
    // NCCL's run-to-run ULP noise (KI-5; the all-reduce is not bit-reproducible). This
    // confirms enabling dropout=0 doesn't perturb the DDP path beyond that noise floor.
    let (loss_nd_2, _, _) = run_ddp(&d2, test_config(vocab), &corpus, Some(&valid), &base_dcfg);
    let max_loss_diff_2 = loss_nd_2
        .iter()
        .zip(&loss_p0_2)
        .map(|(a, b)| (a - b).abs())
        .fold(0.0f32, f32::max);
    println!(
        "T21 GATE A2 (world=2 p=0 vs no-dropout, KI-5 noise): max |loss diff| = {max_loss_diff_2:.3e}"
    );
    assert!(
        max_loss_diff_2 < 1e-6,
        "world=2 p=0 diverged from no-dropout beyond NCCL noise: {max_loss_diff_2:.3e}"
    );
    // GATE B — dropout is LIVE with p>0 under DDP. If model.train() were not wired
    // (the pre-T21 bug), the model would stay in eval mode and the p=0.2 forward would
    // be IDENTITY → loss trace bit-identical to p=0 (diff at the ~1e-7 NCCL noise
    // floor). A difference orders of magnitude above that proves dropout masks are
    // actually applied during the training forward — and that they survive the mid-run
    // eval flips (model.train() is re-asserted each step). Inverted scaling + masking
    // perturbs every step, so the gap is large (>1e-3 ≫ KI-5 noise ~2.4e-7).
    let max_live_diff = loss_p0_2
        .iter()
        .zip(&loss_p_2)
        .map(|(a, b)| (a - b).abs())
        .fold(0.0f32, f32::max);
    println!(
        "T21 GATE B (dropout live, world=2): p0[last]={:.6} p0.2[last]={:.6} max |loss diff| = {max_live_diff:.3e}",
        loss_p0_2.last().unwrap(),
        loss_p_2.last().unwrap()
    );
    assert!(
        max_live_diff > 1e-3,
        "p=0.2 DDP loss trace matches p=0 — dropout is NOT live under DDP \
         (model.train() not wired): max |loss diff| {max_live_diff:.3e}"
    );
    // No NaN/Inf in the p>0 run (dropout converges normally under DDP).
    assert!(
        loss_p_2.iter().all(|l| l.is_finite()),
        "p=0.2 DDP loss has non-finite values"
    );
    // GATE C — train_rank actually sets TRAINING mode (direct, complementary proof of
    // model.train() being wired). Use a dedicated short run with eval_every=0 so no
    // eval fires: a model that finishes a training step in training mode proves
    // train_rank called model.train(). (With eval enabled, eval_loss → model.eval()
    // runs LAST on the final step and legitimately leaves the model in eval mode —
    // same as the single-GPU loop — so is_training() after an eval-enabled run reflects
    // the final eval, not the training-mode wiring. GATE B already proves dropout
    // survives the mid-run eval flips via the per-step model.train() restore.) On the
    // pre-T21 code is_training() stays false (model never left the default eval mode).
    let dcfg_noeval = DdpConfig {
        steps: 2,
        eval_every: 0,
        ..base_dcfg.clone()
    };
    let (_, _, train_flag) = run_ddp(&d1, cfg_p_w2, &corpus, None, &dcfg_noeval);
    assert!(
        train_flag,
        "model not in training mode after a no-eval DDP run — model.train() not wired in train_rank"
    );
    println!("T21 GATE C (train_rank sets training mode): is_training() == true ✅");
 }