libtest with --test-threads=1 (the documented invariant for this file's DDP
tests) runs tests alphabetically. The new
proc_per_gpu_dropout_is_live_and_p0_matches_no_dropout ('d') runs BEFORE
proc_per_gpu_matches_single_gpu_and_thread_path ('m'). It sets ENV_DROPOUT=0.2
via std::env::set_var; if left in place, the correctness test's spawned workers
would inherit it (Command inherits parent env by default) and build with
cfg.dropout=0.2 while its single-GPU baseline (run_single_gpu → test_config →
dropout=0) stays at 0 — GATE (a) `max_rel_single < 1e-3` would blow up by
orders of magnitude.
Two defenses:
- correctness test remove_var(ENV_DROPOUT) before spawn (belt): even if the
dropout test forgot to clean up, this test starts from a clean env.
- dropout test remove_var(ENV_DROPOUT, ENV_DUMP_DIR) at exit (suspenders):
keep the invariant "each test leaves the env as it found it" so any future
test added after these two starts clean too.
Same --test-threads=1 SAFETY comment applies (no concurrent env access).
Analogue of the ddp_dropout_is_live_and_p0_bit_identical test (T21, thread-per-
GPU) for the process-per-GPU launcher. Runs launch_processes twice on the same
corpus / init / config with the ONLY difference being cfg.dropout (passed
launcher→worker via a new XTRAIN_TEST_DROPOUT env — worker re-execs cannot
inherit argv changes), reads rank 0's loss trajectory from both runs, and
asserts GATE B: max |loss diff| > 1e-3.
The threshold sits ~4 orders of magnitude above this box's KI-5 cross-rank NCCL
noise floor (~1e-7), so it is an unambiguous "dropout mask is applied" signal,
not a noise measurement. Pre-fix (missing cfg.dropout = ... in the worker /
launcher, exactly the gap the paired launcher commit closes) both traces are
bit-identical and this test FAILs.
Also wires ENV_DROPOUT into the shared worker entry so the existing correctness
test's contract is unchanged (absent env → 0.0 → same synth run as before).
p0/ and p02/ subdirs isolate the two invocations' dumps.
Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path
used by the v12 SFT runs:
- cross_entropy ignores negative targets (-100 ignore-index), normalizing by
valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu).
- Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is
formatted as 'User: .. \nAssistant:' + answer + <|endoftext|>, prompt tokens
masked to -100 while answer+EOS are supervised; i32 label cache alongside the
u16 token cache; sample() retries windows that are fully masked; eval uses
target_window so masking applies to val loss too (data.rs, train_loop.rs).
- train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues
training from a base checkpoint.
- greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt
generation eval.
Test fixtures updated for the new Corpus.labels field; dropout.rs carries
incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout);
correctness rests on the documented v12 base+SFT runs on the GPU box.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher
path (DdpContext::init + train_rank). It would have caught the original bug:
- GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is
byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the
step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce
short-circuits and one step has no optimizer-state compounding; the only
residual non-determinism is the engine's atomicAdd backward-reduction
order (the documented fresh-train md5 caveat — dropout-independent), so the
post-step params are checked against that tight ULP floor (< 1e-7).
- GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's
run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible
on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it.
- GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 —
orders of magnitude above every noise floor here (~3e-2 observed). On the
pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and
the trace would match p=0 at the noise floor — this gate fails. (Verified by
simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.)
- GATE C: a dedicated no-eval run ends with model.is_training() == true,
direct proof that train_rank called model.train().
- p>0 run is finite (no NaN/Inf).
eval_every < steps so a periodic eval fires mid-run (flipping to eval mode),
exercising the per-step model.train() restore discipline the pilot called out.
Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch;
accum_steps=1 bit-identical (max|Δ|==0) to no-accum; real train() loop
with accum tracks a big-batch baseline over 20 AdamW steps.
- ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of
the same effective size (loss + cross-rank + vs-baseline).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The cross-rank `max|p0-p1| == 0.0` check is flaky on this PCIe-only box: NCCL's
all-reduce is not bit-reproducible run-to-run across ranks (algorithm/chunk
choice is unstable), so cross-rank params can differ by a few ULP (observed
<=1.2e-7) even with identical init + averaged grads. The load-bearing gate is the
loss-trajectory match (~5.7e-7); a tight <1e-6 tolerance is the honest invariant.
Also extend ddp_throughput_scaling to include world=8 for the KI-5 before/after
scaling table.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.
DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.
- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
(losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
--dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
--val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
30-step bench charged the one-time NCCL init + 4 model builds (present at world=4,
absent at world=1) against the wall clock, understating steady-state scaling
(in-loop tok/s already showed ~53k at 4 GPUs). Bump to 150 steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dash5 verify: loss trajectory matches single-GPU to max_rel 1.16e-7 and
cross-rank params are bit-identical (0.0), but DDP-vs-single-GPU per-param rel
diff is ~2.8e-3 after 20 AdamW steps — expected, since the two differ only in
gradient summation order (fp add isn't associative) and that rounding compounds.
Bump check (c) 1e-3 -> 1e-2 (a/b stay tight). Also remove an unused DType import.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects
the set), NCCL all-reduce gradients each step, train the tiny transformer on
TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda
build keeps a stub main.
tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data
-> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP
vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed
per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>