test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)
Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -38,6 +38,53 @@ fn test_config(vocab: usize) -> Config {
|
||||
cfg
|
||||
}
|
||||
|
||||
/// Run `cfg`/`dcfg` as a DDP job over `devices` (the same launcher path as
|
||||
/// production — `DdpContext::init` + `train_rank` per rank) and return rank 0's
|
||||
/// (loss trace, final params on host, final `is_training()` flag). `cfg` carries
|
||||
/// the dropout prob; `dcfg` carries the loop knobs. Caller asserts.
|
||||
///
|
||||
/// `world == 1` is the deterministic path: `all_reduce_average_grads` short-circuits
|
||||
/// (no NCCL collective), so the run is bit-reproducible — used for the bit-identity
|
||||
/// gate. `world >= 2` exercises the real cross-rank NCCL all-reduce, which is not
|
||||
/// bit-reproducible run-to-run on this PCIe box (KI-5), so those gates use the same
|
||||
/// ULP/relative tolerances as the rest of this file.
|
||||
fn run_ddp(
|
||||
devices: &[u32],
|
||||
cfg: Config,
|
||||
corpus: &Corpus,
|
||||
valid: Option<&Corpus>,
|
||||
dcfg: &DdpConfig,
|
||||
) -> (Vec<f32>, Vec<Vec<f32>>, bool) {
|
||||
let world = devices.len();
|
||||
let id = get_unique_id();
|
||||
let results: Vec<(Vec<f32>, Vec<Vec<f32>>, bool)> = std::thread::scope(|s| {
|
||||
let handles: Vec<_> = devices
|
||||
.iter()
|
||||
.enumerate()
|
||||
.map(|(rank, &dev)| {
|
||||
let dcfg = dcfg.clone();
|
||||
let corpus = &corpus;
|
||||
s.spawn(move || {
|
||||
let ctx = DdpContext::init(rank, world, id, dev);
|
||||
let device = Device::Cuda(dev);
|
||||
let model = build_model(cfg, device);
|
||||
// Only rank 0 holds the val corpus (mirrors launch()).
|
||||
let v = if rank == 0 { valid } else { None };
|
||||
let res = train_rank(&ctx, &model, device, corpus, v, &dcfg);
|
||||
let host = model
|
||||
.params()
|
||||
.iter()
|
||||
.map(|p| p.value().to_device(Device::Cpu).as_slice::<f32>().to_vec())
|
||||
.collect::<Vec<_>>();
|
||||
(res.losses, host, model.is_training())
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
handles.into_iter().map(|h| h.join().unwrap()).collect()
|
||||
});
|
||||
results.into_iter().next().unwrap()
|
||||
}
|
||||
|
||||
// Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
|
||||
// batch is processed on one device. Returns (loss trace, final params on host).
|
||||
fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
|
||||
@@ -386,3 +433,181 @@ fn ddp_throughput_scaling() {
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/// T21 regression: prove dropout is actually LIVE under DDP (with `p>0`), and that
|
||||
/// `p=0` is bit-identical to the no-dropout path. Guards the V9-PILOT launcher-
|
||||
/// wiring gap — `train_ddp` had no `--dropout` flag and `train_rank` never called
|
||||
/// `model.train()`, so under DDP every forward ran in the default eval mode and
|
||||
/// dropout was a silent identity regardless of config. Op/single-GPU tests never
|
||||
/// exercised dropout-under-DDP, so it slipped through; this test runs the REAL
|
||||
/// launcher path (`DdpContext::init` + `train_rank`).
|
||||
///
|
||||
/// On the pre-T21 code, both load-bearing gates FAIL: GATE B (p>0 trace would be
|
||||
/// bit-identical to p=0 — model stuck in eval mode → dropout is identity) and GATE C
|
||||
/// (`is_training()` would be false after the run).
|
||||
///
|
||||
/// p=0 regression (GATE A) is checked at `world=1`, ONE step, where the NCCL
|
||||
/// all-reduce short-circuits: the p=0 FORWARD is byte-identical to no-dropout so the
|
||||
/// loss is BIT-IDENTICAL (== 0.0), and the post-step params match within the engine's
|
||||
/// atomicAdd backward-reduction ULP floor (< 1e-7, dropout-independent — the
|
||||
/// fresh-train md5 caveat). The cross-rank NCCL all-reduce (`world>=2`) is not
|
||||
/// bit-reproducible run-to-run on this PCIe box (KI-5, observed ≤~2.4e-7), so the
|
||||
/// `world=2` p=0-vs-no-dropout check (GATE A2) uses the same KI-5 ULP tolerance as the
|
||||
/// rest of this file. GATE B's live-dropout signal (>1e-3) sits ~4 orders of magnitude
|
||||
/// above every noise floor here, so it carries the load.
|
||||
#[test]
|
||||
fn ddp_dropout_is_live_and_p0_bit_identical() {
|
||||
if device::device_count().unwrap_or(0) < 2 {
|
||||
eprintln!("skip: need >= 2 GPUs");
|
||||
return;
|
||||
}
|
||||
|
||||
let vocab = 64usize;
|
||||
let corpus = synth_corpus(vocab, 4096);
|
||||
let steps = 20usize;
|
||||
// eval_every < steps so a periodic eval fires MID-run (flipping the model to
|
||||
// eval mode via eval_loss → model.eval()). The per-step model.train() must
|
||||
// restore training mode so dropout stays live across the eval boundary — this is
|
||||
// exactly the train/eval discipline the pilot called out. A held-out slice gives
|
||||
// rank 0 something to eval on.
|
||||
let valid = synth_corpus(vocab, 512);
|
||||
let base_dcfg = DdpConfig {
|
||||
seq_len: 32,
|
||||
batch_size: 8, // global; 4 per rank with world=2
|
||||
accum_steps: 1,
|
||||
steps,
|
||||
schedule: LrSchedule {
|
||||
max_lr: 3e-3,
|
||||
min_lr: 3e-4,
|
||||
warmup: 3,
|
||||
total: steps,
|
||||
},
|
||||
weight_decay: 0.1,
|
||||
max_grad_norm: 1.0,
|
||||
log_every: 1_000_000, // silence per-step logging
|
||||
seed: 7,
|
||||
eval_every: 7, // fires at steps 6, 13, 19 — flips to eval mode mid-run
|
||||
eval_batches: 4,
|
||||
ckpt_path: None,
|
||||
};
|
||||
|
||||
// --- GATE A: p=0 == no-dropout at world=1, ONE step (the deterministic scope). ---
|
||||
// The regression guard for `--dropout 0`. ops::dropout(p=0) returns x.clone() (a
|
||||
// graph no-op) regardless of training mode, so the p=0 FORWARD graph is byte-for-
|
||||
// byte the no-dropout forward → loss[0] must be BIT-IDENTICAL (the load-bearing
|
||||
// claim, asserted == 0.0). At world=1 the NCCL all-reduce short-circuits, and one
|
||||
// step has no optimizer-state compounding; the only residual non-determinism is
|
||||
// the engine's atomicAdd backward-reduction ORDER (the documented fresh-train md5
|
||||
// caveat — dropout-INDEPENDENT, present with or without the dropout op), which
|
||||
// moves the post-step params by a single grad ULP. So params are checked against
|
||||
// that tight reduction floor (< 1e-7), the same nature as the cross-rank KI-5
|
||||
// tolerance used elsewhere in this file — not a dropout signal. GATE B (live) has
|
||||
// a >1e-3 signal, ~4 orders of magnitude above this floor, so it carries the load.
|
||||
let d1 = [0u32];
|
||||
let dcfg_1step = DdpConfig {
|
||||
steps: 1,
|
||||
eval_every: 0,
|
||||
..base_dcfg.clone()
|
||||
};
|
||||
let cfg_nodrop = test_config(vocab); // cfg.dropout defaults to 0.0
|
||||
assert_eq!(cfg_nodrop.dropout, 0.0, "baseline cfg must have dropout 0");
|
||||
let mut cfg_p0 = test_config(vocab);
|
||||
cfg_p0.dropout = 0.0; // explicitly set p=0 — must not perturb anything
|
||||
let (loss_nd1, params_nd1, _) = run_ddp(&d1, cfg_nodrop, &corpus, None, &dcfg_1step);
|
||||
let (loss_p01, params_p01, _) = run_ddp(&d1, cfg_p0, &corpus, None, &dcfg_1step);
|
||||
let max_loss_diff_1 = (loss_nd1[0] - loss_p01[0]).abs();
|
||||
let max_param_diff_1 = params_nd1
|
||||
.iter()
|
||||
.zip(¶ms_p01)
|
||||
.flat_map(|(a, b)| a.iter().zip(b).map(|(x, y)| (x - y).abs()))
|
||||
.fold(0.0f32, f32::max);
|
||||
println!(
|
||||
"T21 GATE A (world=1, 1 step, p=0 vs no-dropout): |loss diff| = {max_loss_diff_1:.3e} \
|
||||
(bit-identical forward), max |param diff| = {max_param_diff_1:.3e} (atomicAdd floor)"
|
||||
);
|
||||
assert_eq!(
|
||||
max_loss_diff_1, 0.0,
|
||||
"world=1 p=0 forward loss not bit-identical to no-dropout path"
|
||||
);
|
||||
assert!(
|
||||
max_param_diff_1 < 1e-7,
|
||||
"world=1 p=0 post-step params diverged from no-dropout beyond the atomicAdd \
|
||||
reduction floor: {max_param_diff_1:.3e}"
|
||||
);
|
||||
|
||||
// --- world=2 runs: real cross-rank NCCL all-reduce (the production path). ---
|
||||
let d2 = [0u32, 1u32];
|
||||
let mut cfg_p0_w2 = test_config(vocab);
|
||||
cfg_p0_w2.dropout = 0.0;
|
||||
let mut cfg_p_w2 = test_config(vocab);
|
||||
cfg_p_w2.dropout = 0.2;
|
||||
let (loss_p0_2, _params_p0_2, _) = run_ddp(&d2, cfg_p0_w2, &corpus, Some(&valid), &base_dcfg);
|
||||
let (loss_p_2, _params_p_2, _) = run_ddp(&d2, cfg_p_w2, &corpus, Some(&valid), &base_dcfg);
|
||||
|
||||
// GATE A2 — under DDP (world=2), p=0 matches a separate no-dropout baseline within
|
||||
// NCCL's run-to-run ULP noise (KI-5; the all-reduce is not bit-reproducible). This
|
||||
// confirms enabling dropout=0 doesn't perturb the DDP path beyond that noise floor.
|
||||
let (loss_nd_2, _, _) = run_ddp(&d2, test_config(vocab), &corpus, Some(&valid), &base_dcfg);
|
||||
let max_loss_diff_2 = loss_nd_2
|
||||
.iter()
|
||||
.zip(&loss_p0_2)
|
||||
.map(|(a, b)| (a - b).abs())
|
||||
.fold(0.0f32, f32::max);
|
||||
println!(
|
||||
"T21 GATE A2 (world=2 p=0 vs no-dropout, KI-5 noise): max |loss diff| = {max_loss_diff_2:.3e}"
|
||||
);
|
||||
assert!(
|
||||
max_loss_diff_2 < 1e-6,
|
||||
"world=2 p=0 diverged from no-dropout beyond NCCL noise: {max_loss_diff_2:.3e}"
|
||||
);
|
||||
|
||||
// GATE B — dropout is LIVE with p>0 under DDP. If model.train() were not wired
|
||||
// (the pre-T21 bug), the model would stay in eval mode and the p=0.2 forward would
|
||||
// be IDENTITY → loss trace bit-identical to p=0 (diff at the ~1e-7 NCCL noise
|
||||
// floor). A difference orders of magnitude above that proves dropout masks are
|
||||
// actually applied during the training forward — and that they survive the mid-run
|
||||
// eval flips (model.train() is re-asserted each step). Inverted scaling + masking
|
||||
// perturbs every step, so the gap is large (>1e-3 ≫ KI-5 noise ~2.4e-7).
|
||||
let max_live_diff = loss_p0_2
|
||||
.iter()
|
||||
.zip(&loss_p_2)
|
||||
.map(|(a, b)| (a - b).abs())
|
||||
.fold(0.0f32, f32::max);
|
||||
println!(
|
||||
"T21 GATE B (dropout live, world=2): p0[last]={:.6} p0.2[last]={:.6} max |loss diff| = {max_live_diff:.3e}",
|
||||
loss_p0_2.last().unwrap(),
|
||||
loss_p_2.last().unwrap()
|
||||
);
|
||||
assert!(
|
||||
max_live_diff > 1e-3,
|
||||
"p=0.2 DDP loss trace matches p=0 — dropout is NOT live under DDP \
|
||||
(model.train() not wired): max |loss diff| {max_live_diff:.3e}"
|
||||
);
|
||||
|
||||
// No NaN/Inf in the p>0 run (dropout converges normally under DDP).
|
||||
assert!(
|
||||
loss_p_2.iter().all(|l| l.is_finite()),
|
||||
"p=0.2 DDP loss has non-finite values"
|
||||
);
|
||||
|
||||
// GATE C — train_rank actually sets TRAINING mode (direct, complementary proof of
|
||||
// model.train() being wired). Use a dedicated short run with eval_every=0 so no
|
||||
// eval fires: a model that finishes a training step in training mode proves
|
||||
// train_rank called model.train(). (With eval enabled, eval_loss → model.eval()
|
||||
// runs LAST on the final step and legitimately leaves the model in eval mode —
|
||||
// same as the single-GPU loop — so is_training() after an eval-enabled run reflects
|
||||
// the final eval, not the training-mode wiring. GATE B already proves dropout
|
||||
// survives the mid-run eval flips via the per-step model.train() restore.) On the
|
||||
// pre-T21 code is_training() stays false (model never left the default eval mode).
|
||||
let dcfg_noeval = DdpConfig {
|
||||
steps: 2,
|
||||
eval_every: 0,
|
||||
..base_dcfg.clone()
|
||||
};
|
||||
let (_, _, train_flag) = run_ddp(&d1, cfg_p_w2, &corpus, None, &dcfg_noeval);
|
||||
assert!(
|
||||
train_flag,
|
||||
"model not in training mode after a no-eval DDP run — model.train() not wired in train_rank"
|
||||
);
|
||||
println!("T21 GATE C (train_rank sets training mode): is_training() == true ✅");
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user