test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical)
Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -38,6 +38,53 @@ fn test_config(vocab: usize) -> Config {
|
|||||||
cfg
|
cfg
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// Run `cfg`/`dcfg` as a DDP job over `devices` (the same launcher path as
|
||||||
|
/// production — `DdpContext::init` + `train_rank` per rank) and return rank 0's
|
||||||
|
/// (loss trace, final params on host, final `is_training()` flag). `cfg` carries
|
||||||
|
/// the dropout prob; `dcfg` carries the loop knobs. Caller asserts.
|
||||||
|
///
|
||||||
|
/// `world == 1` is the deterministic path: `all_reduce_average_grads` short-circuits
|
||||||
|
/// (no NCCL collective), so the run is bit-reproducible — used for the bit-identity
|
||||||
|
/// gate. `world >= 2` exercises the real cross-rank NCCL all-reduce, which is not
|
||||||
|
/// bit-reproducible run-to-run on this PCIe box (KI-5), so those gates use the same
|
||||||
|
/// ULP/relative tolerances as the rest of this file.
|
||||||
|
fn run_ddp(
|
||||||
|
devices: &[u32],
|
||||||
|
cfg: Config,
|
||||||
|
corpus: &Corpus,
|
||||||
|
valid: Option<&Corpus>,
|
||||||
|
dcfg: &DdpConfig,
|
||||||
|
) -> (Vec<f32>, Vec<Vec<f32>>, bool) {
|
||||||
|
let world = devices.len();
|
||||||
|
let id = get_unique_id();
|
||||||
|
let results: Vec<(Vec<f32>, Vec<Vec<f32>>, bool)> = std::thread::scope(|s| {
|
||||||
|
let handles: Vec<_> = devices
|
||||||
|
.iter()
|
||||||
|
.enumerate()
|
||||||
|
.map(|(rank, &dev)| {
|
||||||
|
let dcfg = dcfg.clone();
|
||||||
|
let corpus = &corpus;
|
||||||
|
s.spawn(move || {
|
||||||
|
let ctx = DdpContext::init(rank, world, id, dev);
|
||||||
|
let device = Device::Cuda(dev);
|
||||||
|
let model = build_model(cfg, device);
|
||||||
|
// Only rank 0 holds the val corpus (mirrors launch()).
|
||||||
|
let v = if rank == 0 { valid } else { None };
|
||||||
|
let res = train_rank(&ctx, &model, device, corpus, v, &dcfg);
|
||||||
|
let host = model
|
||||||
|
.params()
|
||||||
|
.iter()
|
||||||
|
.map(|p| p.value().to_device(Device::Cpu).as_slice::<f32>().to_vec())
|
||||||
|
.collect::<Vec<_>>();
|
||||||
|
(res.losses, host, model.is_training())
|
||||||
|
})
|
||||||
|
})
|
||||||
|
.collect();
|
||||||
|
handles.into_iter().map(|h| h.join().unwrap()).collect()
|
||||||
|
});
|
||||||
|
results.into_iter().next().unwrap()
|
||||||
|
}
|
||||||
|
|
||||||
// Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
|
// Single-GPU baseline: the SAME loop as the DDP rank but world=1, so the global
|
||||||
// batch is processed on one device. Returns (loss trace, final params on host).
|
// batch is processed on one device. Returns (loss trace, final params on host).
|
||||||
fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
|
fn run_single_gpu(cfg: Config, corpus: &Corpus, dcfg: &DdpConfig) -> (Vec<f32>, Vec<Vec<f32>>) {
|
||||||
@@ -386,3 +433,181 @@ fn ddp_throughput_scaling() {
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// T21 regression: prove dropout is actually LIVE under DDP (with `p>0`), and that
|
||||||
|
/// `p=0` is bit-identical to the no-dropout path. Guards the V9-PILOT launcher-
|
||||||
|
/// wiring gap — `train_ddp` had no `--dropout` flag and `train_rank` never called
|
||||||
|
/// `model.train()`, so under DDP every forward ran in the default eval mode and
|
||||||
|
/// dropout was a silent identity regardless of config. Op/single-GPU tests never
|
||||||
|
/// exercised dropout-under-DDP, so it slipped through; this test runs the REAL
|
||||||
|
/// launcher path (`DdpContext::init` + `train_rank`).
|
||||||
|
///
|
||||||
|
/// On the pre-T21 code, both load-bearing gates FAIL: GATE B (p>0 trace would be
|
||||||
|
/// bit-identical to p=0 — model stuck in eval mode → dropout is identity) and GATE C
|
||||||
|
/// (`is_training()` would be false after the run).
|
||||||
|
///
|
||||||
|
/// p=0 regression (GATE A) is checked at `world=1`, ONE step, where the NCCL
|
||||||
|
/// all-reduce short-circuits: the p=0 FORWARD is byte-identical to no-dropout so the
|
||||||
|
/// loss is BIT-IDENTICAL (== 0.0), and the post-step params match within the engine's
|
||||||
|
/// atomicAdd backward-reduction ULP floor (< 1e-7, dropout-independent — the
|
||||||
|
/// fresh-train md5 caveat). The cross-rank NCCL all-reduce (`world>=2`) is not
|
||||||
|
/// bit-reproducible run-to-run on this PCIe box (KI-5, observed ≤~2.4e-7), so the
|
||||||
|
/// `world=2` p=0-vs-no-dropout check (GATE A2) uses the same KI-5 ULP tolerance as the
|
||||||
|
/// rest of this file. GATE B's live-dropout signal (>1e-3) sits ~4 orders of magnitude
|
||||||
|
/// above every noise floor here, so it carries the load.
|
||||||
|
#[test]
|
||||||
|
fn ddp_dropout_is_live_and_p0_bit_identical() {
|
||||||
|
if device::device_count().unwrap_or(0) < 2 {
|
||||||
|
eprintln!("skip: need >= 2 GPUs");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
let vocab = 64usize;
|
||||||
|
let corpus = synth_corpus(vocab, 4096);
|
||||||
|
let steps = 20usize;
|
||||||
|
// eval_every < steps so a periodic eval fires MID-run (flipping the model to
|
||||||
|
// eval mode via eval_loss → model.eval()). The per-step model.train() must
|
||||||
|
// restore training mode so dropout stays live across the eval boundary — this is
|
||||||
|
// exactly the train/eval discipline the pilot called out. A held-out slice gives
|
||||||
|
// rank 0 something to eval on.
|
||||||
|
let valid = synth_corpus(vocab, 512);
|
||||||
|
let base_dcfg = DdpConfig {
|
||||||
|
seq_len: 32,
|
||||||
|
batch_size: 8, // global; 4 per rank with world=2
|
||||||
|
accum_steps: 1,
|
||||||
|
steps,
|
||||||
|
schedule: LrSchedule {
|
||||||
|
max_lr: 3e-3,
|
||||||
|
min_lr: 3e-4,
|
||||||
|
warmup: 3,
|
||||||
|
total: steps,
|
||||||
|
},
|
||||||
|
weight_decay: 0.1,
|
||||||
|
max_grad_norm: 1.0,
|
||||||
|
log_every: 1_000_000, // silence per-step logging
|
||||||
|
seed: 7,
|
||||||
|
eval_every: 7, // fires at steps 6, 13, 19 — flips to eval mode mid-run
|
||||||
|
eval_batches: 4,
|
||||||
|
ckpt_path: None,
|
||||||
|
};
|
||||||
|
|
||||||
|
// --- GATE A: p=0 == no-dropout at world=1, ONE step (the deterministic scope). ---
|
||||||
|
// The regression guard for `--dropout 0`. ops::dropout(p=0) returns x.clone() (a
|
||||||
|
// graph no-op) regardless of training mode, so the p=0 FORWARD graph is byte-for-
|
||||||
|
// byte the no-dropout forward → loss[0] must be BIT-IDENTICAL (the load-bearing
|
||||||
|
// claim, asserted == 0.0). At world=1 the NCCL all-reduce short-circuits, and one
|
||||||
|
// step has no optimizer-state compounding; the only residual non-determinism is
|
||||||
|
// the engine's atomicAdd backward-reduction ORDER (the documented fresh-train md5
|
||||||
|
// caveat — dropout-INDEPENDENT, present with or without the dropout op), which
|
||||||
|
// moves the post-step params by a single grad ULP. So params are checked against
|
||||||
|
// that tight reduction floor (< 1e-7), the same nature as the cross-rank KI-5
|
||||||
|
// tolerance used elsewhere in this file — not a dropout signal. GATE B (live) has
|
||||||
|
// a >1e-3 signal, ~4 orders of magnitude above this floor, so it carries the load.
|
||||||
|
let d1 = [0u32];
|
||||||
|
let dcfg_1step = DdpConfig {
|
||||||
|
steps: 1,
|
||||||
|
eval_every: 0,
|
||||||
|
..base_dcfg.clone()
|
||||||
|
};
|
||||||
|
let cfg_nodrop = test_config(vocab); // cfg.dropout defaults to 0.0
|
||||||
|
assert_eq!(cfg_nodrop.dropout, 0.0, "baseline cfg must have dropout 0");
|
||||||
|
let mut cfg_p0 = test_config(vocab);
|
||||||
|
cfg_p0.dropout = 0.0; // explicitly set p=0 — must not perturb anything
|
||||||
|
let (loss_nd1, params_nd1, _) = run_ddp(&d1, cfg_nodrop, &corpus, None, &dcfg_1step);
|
||||||
|
let (loss_p01, params_p01, _) = run_ddp(&d1, cfg_p0, &corpus, None, &dcfg_1step);
|
||||||
|
let max_loss_diff_1 = (loss_nd1[0] - loss_p01[0]).abs();
|
||||||
|
let max_param_diff_1 = params_nd1
|
||||||
|
.iter()
|
||||||
|
.zip(¶ms_p01)
|
||||||
|
.flat_map(|(a, b)| a.iter().zip(b).map(|(x, y)| (x - y).abs()))
|
||||||
|
.fold(0.0f32, f32::max);
|
||||||
|
println!(
|
||||||
|
"T21 GATE A (world=1, 1 step, p=0 vs no-dropout): |loss diff| = {max_loss_diff_1:.3e} \
|
||||||
|
(bit-identical forward), max |param diff| = {max_param_diff_1:.3e} (atomicAdd floor)"
|
||||||
|
);
|
||||||
|
assert_eq!(
|
||||||
|
max_loss_diff_1, 0.0,
|
||||||
|
"world=1 p=0 forward loss not bit-identical to no-dropout path"
|
||||||
|
);
|
||||||
|
assert!(
|
||||||
|
max_param_diff_1 < 1e-7,
|
||||||
|
"world=1 p=0 post-step params diverged from no-dropout beyond the atomicAdd \
|
||||||
|
reduction floor: {max_param_diff_1:.3e}"
|
||||||
|
);
|
||||||
|
|
||||||
|
// --- world=2 runs: real cross-rank NCCL all-reduce (the production path). ---
|
||||||
|
let d2 = [0u32, 1u32];
|
||||||
|
let mut cfg_p0_w2 = test_config(vocab);
|
||||||
|
cfg_p0_w2.dropout = 0.0;
|
||||||
|
let mut cfg_p_w2 = test_config(vocab);
|
||||||
|
cfg_p_w2.dropout = 0.2;
|
||||||
|
let (loss_p0_2, _params_p0_2, _) = run_ddp(&d2, cfg_p0_w2, &corpus, Some(&valid), &base_dcfg);
|
||||||
|
let (loss_p_2, _params_p_2, _) = run_ddp(&d2, cfg_p_w2, &corpus, Some(&valid), &base_dcfg);
|
||||||
|
|
||||||
|
// GATE A2 — under DDP (world=2), p=0 matches a separate no-dropout baseline within
|
||||||
|
// NCCL's run-to-run ULP noise (KI-5; the all-reduce is not bit-reproducible). This
|
||||||
|
// confirms enabling dropout=0 doesn't perturb the DDP path beyond that noise floor.
|
||||||
|
let (loss_nd_2, _, _) = run_ddp(&d2, test_config(vocab), &corpus, Some(&valid), &base_dcfg);
|
||||||
|
let max_loss_diff_2 = loss_nd_2
|
||||||
|
.iter()
|
||||||
|
.zip(&loss_p0_2)
|
||||||
|
.map(|(a, b)| (a - b).abs())
|
||||||
|
.fold(0.0f32, f32::max);
|
||||||
|
println!(
|
||||||
|
"T21 GATE A2 (world=2 p=0 vs no-dropout, KI-5 noise): max |loss diff| = {max_loss_diff_2:.3e}"
|
||||||
|
);
|
||||||
|
assert!(
|
||||||
|
max_loss_diff_2 < 1e-6,
|
||||||
|
"world=2 p=0 diverged from no-dropout beyond NCCL noise: {max_loss_diff_2:.3e}"
|
||||||
|
);
|
||||||
|
|
||||||
|
// GATE B — dropout is LIVE with p>0 under DDP. If model.train() were not wired
|
||||||
|
// (the pre-T21 bug), the model would stay in eval mode and the p=0.2 forward would
|
||||||
|
// be IDENTITY → loss trace bit-identical to p=0 (diff at the ~1e-7 NCCL noise
|
||||||
|
// floor). A difference orders of magnitude above that proves dropout masks are
|
||||||
|
// actually applied during the training forward — and that they survive the mid-run
|
||||||
|
// eval flips (model.train() is re-asserted each step). Inverted scaling + masking
|
||||||
|
// perturbs every step, so the gap is large (>1e-3 ≫ KI-5 noise ~2.4e-7).
|
||||||
|
let max_live_diff = loss_p0_2
|
||||||
|
.iter()
|
||||||
|
.zip(&loss_p_2)
|
||||||
|
.map(|(a, b)| (a - b).abs())
|
||||||
|
.fold(0.0f32, f32::max);
|
||||||
|
println!(
|
||||||
|
"T21 GATE B (dropout live, world=2): p0[last]={:.6} p0.2[last]={:.6} max |loss diff| = {max_live_diff:.3e}",
|
||||||
|
loss_p0_2.last().unwrap(),
|
||||||
|
loss_p_2.last().unwrap()
|
||||||
|
);
|
||||||
|
assert!(
|
||||||
|
max_live_diff > 1e-3,
|
||||||
|
"p=0.2 DDP loss trace matches p=0 — dropout is NOT live under DDP \
|
||||||
|
(model.train() not wired): max |loss diff| {max_live_diff:.3e}"
|
||||||
|
);
|
||||||
|
|
||||||
|
// No NaN/Inf in the p>0 run (dropout converges normally under DDP).
|
||||||
|
assert!(
|
||||||
|
loss_p_2.iter().all(|l| l.is_finite()),
|
||||||
|
"p=0.2 DDP loss has non-finite values"
|
||||||
|
);
|
||||||
|
|
||||||
|
// GATE C — train_rank actually sets TRAINING mode (direct, complementary proof of
|
||||||
|
// model.train() being wired). Use a dedicated short run with eval_every=0 so no
|
||||||
|
// eval fires: a model that finishes a training step in training mode proves
|
||||||
|
// train_rank called model.train(). (With eval enabled, eval_loss → model.eval()
|
||||||
|
// runs LAST on the final step and legitimately leaves the model in eval mode —
|
||||||
|
// same as the single-GPU loop — so is_training() after an eval-enabled run reflects
|
||||||
|
// the final eval, not the training-mode wiring. GATE B already proves dropout
|
||||||
|
// survives the mid-run eval flips via the per-step model.train() restore.) On the
|
||||||
|
// pre-T21 code is_training() stays false (model never left the default eval mode).
|
||||||
|
let dcfg_noeval = DdpConfig {
|
||||||
|
steps: 2,
|
||||||
|
eval_every: 0,
|
||||||
|
..base_dcfg.clone()
|
||||||
|
};
|
||||||
|
let (_, _, train_flag) = run_ddp(&d1, cfg_p_w2, &corpus, None, &dcfg_noeval);
|
||||||
|
assert!(
|
||||||
|
train_flag,
|
||||||
|
"model not in training mode after a no-eval DDP run — model.train() not wired in train_rank"
|
||||||
|
);
|
||||||
|
println!("T21 GATE C (train_rank sets training mode): is_training() == true ✅");
|
||||||
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user