test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)

Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher
path (DdpContext::init + train_rank). It would have caught the original bug:

- GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is
  byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the
  step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce
  short-circuits and one step has no optimizer-state compounding; the only
  residual non-determinism is the engine's atomicAdd backward-reduction
  order (the documented fresh-train md5 caveat — dropout-independent), so the
  post-step params are checked against that tight ULP floor (< 1e-7).
- GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's
  run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible
  on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it.
- GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 —
  orders of magnitude above every noise floor here (~3e-2 observed). On the
  pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and
  the trace would match p=0 at the noise floor — this gate fails. (Verified by
  simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.)
- GATE C: a dedicated no-eval run ends with model.is_training() == true,
  direct proof that train_rank called model.train().
- p>0 run is finite (no NaN/Inf).

eval_every < steps so a periodic eval fires mid-run (flipping to eval mode),
exercising the per-step model.train() restore discipline the pilot called out.
Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-18 21:22:49 +08:00
parent 81f3cf59e5
commit 980605474b

View File

@@ -38,6 +38,53 @@ fn test_config(vocab: usize) -> Config {
cfg cfg
} }
/// Run `cfg`/`dcfg` as a DDP job over `devices` (the same launcher path as
/// production — `DdpContext::init` + `train_rank` per rank) and return rank 0's
/// (loss trace, final params on host, final `is_training()` flag). `cfg` carries
/// the dropout prob; `dcfg` carries the loop knobs. Caller asserts.
///
/// `world == 1` is the deterministic path: `all_reduce_average_grads` short-circuits
/// (no NCCL collective), so the run is bit-reproducible — used for the bit-identity
/// gate. `world >= 2` exercises the real cross-rank NCCL all-reduce, which is not
/// bit-reproducible run-to-run on this PCIe box (KI-5), so those gates use the same
/// ULP/relative tolerances as the rest of this file.
fn run_ddp(
devices: &[u32],
cfg: Config,
corpus: &Corpus,
valid: Option<&Corpus>,
dcfg: &DdpConfig,
) -> (Vec<f32>, Vec<Vec<f32>>, bool) {
let world = devices.len();
let id = get_unique_id();
let results: Vec<(Vec<f32>, Vec<Vec<f32>>, bool)> = std::thread::scope(|s| {
let handles: Vec<_> = devices
.iter()
.enumerate()
.map(|(rank, &dev)| {
let dcfg = dcfg.clone();
let corpus = &corpus;
s.spawn(move || {
let ctx = DdpContext::init(rank, world, id, dev);
let device = Device::Cuda(dev);
let model = build_model(cfg, device);
// Only rank 0 holds the val corpus (mirrors launch()).
let v = if rank == 0 { valid } else { None };
let res = train_rank(&ctx, &model, device, corpus, v, &dcfg);
let host = model
.params()
.iter()
.map(|p| p.value().to_device(Device::Cpu).as_slice::<f32>().to_vec())
.collect::<Vec<_>>();
(res.losses, host, model.is_training())
})
})
.collect();
handles.into_iter().map(|h| h.join().unwrap()).collect()
});
results.into_iter().next().unwrap()
}
// Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global // Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
// batch is processed on one device. Returns (loss trace, final params on host). // batch is processed on one device. Returns (loss trace, final params on host).
fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) { fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
@@ -386,3 +433,181 @@ fn ddp_throughput_scaling() {
); );
} }
} }
/// T21 regression: prove dropout is actually LIVE under DDP (with `p>0`), and that
/// `p=0` is bit-identical to the no-dropout path. Guards the V9-PILOT launcher-
/// wiring gap — `train_ddp` had no `--dropout` flag and `train_rank` never called
/// `model.train()`, so under DDP every forward ran in the default eval mode and
/// dropout was a silent identity regardless of config. Op/single-GPU tests never
/// exercised dropout-under-DDP, so it slipped through; this test runs the REAL
/// launcher path (`DdpContext::init` + `train_rank`).
///
/// On the pre-T21 code, both load-bearing gates FAIL: GATE B (p>0 trace would be
/// bit-identical to p=0 — model stuck in eval mode → dropout is identity) and GATE C
/// (`is_training()` would be false after the run).
///
/// p=0 regression (GATE A) is checked at `world=1`, ONE step, where the NCCL
/// all-reduce short-circuits: the p=0 FORWARD is byte-identical to no-dropout so the
/// loss is BIT-IDENTICAL (== 0.0), and the post-step params match within the engine's
/// atomicAdd backward-reduction ULP floor (< 1e-7, dropout-independent — the
/// fresh-train md5 caveat). The cross-rank NCCL all-reduce (`world>=2`) is not
/// bit-reproducible run-to-run on this PCIe box (KI-5, observed ≤~2.4e-7), so the
/// `world=2` p=0-vs-no-dropout check (GATE A2) uses the same KI-5 ULP tolerance as the
/// rest of this file. GATE B's live-dropout signal (>1e-3) sits ~4 orders of magnitude
/// above every noise floor here, so it carries the load.
#[test]
fn ddp_dropout_is_live_and_p0_bit_identical() {
if device::device_count().unwrap_or(0) < 2 {
eprintln!("skip: need >= 2 GPUs");
return;
}
let vocab = 64usize;
let corpus = synth_corpus(vocab, 4096);
let steps = 20usize;
// eval_every < steps so a periodic eval fires MID-run (flipping the model to
// eval mode via eval_loss → model.eval()). The per-step model.train() must
// restore training mode so dropout stays live across the eval boundary — this is
// exactly the train/eval discipline the pilot called out. A held-out slice gives
// rank 0 something to eval on.
let valid = synth_corpus(vocab, 512);
let base_dcfg = DdpConfig {
seq_len: 32,
batch_size: 8, // global; 4 per rank with world=2
accum_steps: 1,
steps,
schedule: LrSchedule {
max_lr: 3e-3,
min_lr: 3e-4,
warmup: 3,
total: steps,
},
weight_decay: 0.1,
max_grad_norm: 1.0,
log_every: 1_000_000, // silence per-step logging
seed: 7,
eval_every: 7, // fires at steps 6, 13, 19 — flips to eval mode mid-run
eval_batches: 4,
ckpt_path: None,
};
// --- GATE A: p=0 == no-dropout at world=1, ONE step (the deterministic scope). ---
// The regression guard for `--dropout 0`. ops::dropout(p=0) returns x.clone() (a
// graph no-op) regardless of training mode, so the p=0 FORWARD graph is byte-for-
// byte the no-dropout forward → loss[0] must be BIT-IDENTICAL (the load-bearing
// claim, asserted == 0.0). At world=1 the NCCL all-reduce short-circuits, and one
// step has no optimizer-state compounding; the only residual non-determinism is
// the engine's atomicAdd backward-reduction ORDER (the documented fresh-train md5
// caveat — dropout-INDEPENDENT, present with or without the dropout op), which
// moves the post-step params by a single grad ULP. So params are checked against
// that tight reduction floor (< 1e-7), the same nature as the cross-rank KI-5
// tolerance used elsewhere in this file — not a dropout signal. GATE B (live) has
// a >1e-3 signal, ~4 orders of magnitude above this floor, so it carries the load.
let d1 = [0u32];
let dcfg_1step = DdpConfig {
steps: 1,
eval_every: 0,
..base_dcfg.clone()
};
let cfg_nodrop = test_config(vocab); // cfg.dropout defaults to 0.0
assert_eq!(cfg_nodrop.dropout, 0.0, "baseline cfg must have dropout 0");
let mut cfg_p0 = test_config(vocab);
cfg_p0.dropout = 0.0; // explicitly set p=0 — must not perturb anything
let (loss_nd1, params_nd1, _) = run_ddp(&d1, cfg_nodrop, &corpus, None, &dcfg_1step);
let (loss_p01, params_p01, _) = run_ddp(&d1, cfg_p0, &corpus, None, &dcfg_1step);
let max_loss_diff_1 = (loss_nd1[0] - loss_p01[0]).abs();
let max_param_diff_1 = params_nd1
.iter()
.zip(&params_p01)
.flat_map(|(a, b)| a.iter().zip(b).map(|(x, y)| (x - y).abs()))
.fold(0.0f32, f32::max);
println!(
"T21 GATE A (world=1, 1 step, p=0 vs no-dropout): |loss diff| = {max_loss_diff_1:.3e} \
(bit-identical forward), max |param diff| = {max_param_diff_1:.3e} (atomicAdd floor)"
);
assert_eq!(
max_loss_diff_1, 0.0,
"world=1 p=0 forward loss not bit-identical to no-dropout path"
);
assert!(
max_param_diff_1 < 1e-7,
"world=1 p=0 post-step params diverged from no-dropout beyond the atomicAdd \
reduction floor: {max_param_diff_1:.3e}"
);
// --- world=2 runs: real cross-rank NCCL all-reduce (the production path). ---
let d2 = [0u32, 1u32];
let mut cfg_p0_w2 = test_config(vocab);
cfg_p0_w2.dropout = 0.0;
let mut cfg_p_w2 = test_config(vocab);
cfg_p_w2.dropout = 0.2;
let (loss_p0_2, _params_p0_2, _) = run_ddp(&d2, cfg_p0_w2, &corpus, Some(&valid), &base_dcfg);
let (loss_p_2, _params_p_2, _) = run_ddp(&d2, cfg_p_w2, &corpus, Some(&valid), &base_dcfg);
// GATE A2 — under DDP (world=2), p=0 matches a separate no-dropout baseline within
// NCCL's run-to-run ULP noise (KI-5; the all-reduce is not bit-reproducible). This
// confirms enabling dropout=0 doesn't perturb the DDP path beyond that noise floor.
let (loss_nd_2, _, _) = run_ddp(&d2, test_config(vocab), &corpus, Some(&valid), &base_dcfg);
let max_loss_diff_2 = loss_nd_2
.iter()
.zip(&loss_p0_2)
.map(|(a, b)| (a - b).abs())
.fold(0.0f32, f32::max);
println!(
"T21 GATE A2 (world=2 p=0 vs no-dropout, KI-5 noise): max |loss diff| = {max_loss_diff_2:.3e}"
);
assert!(
max_loss_diff_2 < 1e-6,
"world=2 p=0 diverged from no-dropout beyond NCCL noise: {max_loss_diff_2:.3e}"
);
// GATE B — dropout is LIVE with p>0 under DDP. If model.train() were not wired
// (the pre-T21 bug), the model would stay in eval mode and the p=0.2 forward would
// be IDENTITY → loss trace bit-identical to p=0 (diff at the ~1e-7 NCCL noise
// floor). A difference orders of magnitude above that proves dropout masks are
// actually applied during the training forward — and that they survive the mid-run
// eval flips (model.train() is re-asserted each step). Inverted scaling + masking
// perturbs every step, so the gap is large (>1e-3 ≫ KI-5 noise ~2.4e-7).
let max_live_diff = loss_p0_2
.iter()
.zip(&loss_p_2)
.map(|(a, b)| (a - b).abs())
.fold(0.0f32, f32::max);
println!(
"T21 GATE B (dropout live, world=2): p0[last]={:.6} p0.2[last]={:.6} max |loss diff| = {max_live_diff:.3e}",
loss_p0_2.last().unwrap(),
loss_p_2.last().unwrap()
);
assert!(
max_live_diff > 1e-3,
"p=0.2 DDP loss trace matches p=0 — dropout is NOT live under DDP \
(model.train() not wired): max |loss diff| {max_live_diff:.3e}"
);
// No NaN/Inf in the p>0 run (dropout converges normally under DDP).
assert!(
loss_p_2.iter().all(|l| l.is_finite()),
"p=0.2 DDP loss has non-finite values"
);
// GATE C — train_rank actually sets TRAINING mode (direct, complementary proof of
// model.train() being wired). Use a dedicated short run with eval_every=0 so no
// eval fires: a model that finishes a training step in training mode proves
// train_rank called model.train(). (With eval enabled, eval_loss → model.eval()
// runs LAST on the final step and legitimately leaves the model in eval mode —
// same as the single-GPU loop — so is_training() after an eval-enabled run reflects
// the final eval, not the training-mode wiring. GATE B already proves dropout
// survives the mid-run eval flips via the per-step model.train() restore.) On the
// pre-T21 code is_training() stays false (model never left the default eval mode).
let dcfg_noeval = DdpConfig {
steps: 2,
eval_every: 0,
..base_dcfg.clone()
};
let (_, _, train_flag) = run_ddp(&d1, cfg_p_w2, &corpus, None, &dcfg_noeval);
assert!(
train_flag,
"model not in training mode after a no-eval DDP run — model.train() not wired in train_rank"
);
println!("T21 GATE C (train_rank sets training mode): is_training() == true ✅");
}