xtrain

Author	SHA1	Message	Date
Gahow Wang	6465a2d5ce	test: T21-for-proc — clear ENV_DROPOUT across tests to sever ordering coupling libtest with --test-threads=1 (the documented invariant for this file's DDP tests) runs tests alphabetically. The new proc_per_gpu_dropout_is_live_and_p0_matches_no_dropout ('d') runs BEFORE proc_per_gpu_matches_single_gpu_and_thread_path ('m'). It sets ENV_DROPOUT=0.2 via std::env::set_var; if left in place, the correctness test's spawned workers would inherit it (Command inherits parent env by default) and build with cfg.dropout=0.2 while its single-GPU baseline (run_single_gpu → test_config → dropout=0) stays at 0 — GATE (a) `max_rel_single < 1e-3` would blow up by orders of magnitude. Two defenses: - correctness test remove_var(ENV_DROPOUT) before spawn (belt): even if the dropout test forgot to clean up, this test starts from a clean env. - dropout test remove_var(ENV_DROPOUT, ENV_DUMP_DIR) at exit (suspenders): keep the invariant "each test leaves the env as it found it" so any future test added after these two starts clean too. Same --test-threads=1 SAFETY comment applies (no concurrent env access).	2026-07-01 14:09:42 +08:00
Gahow Wang	33a1aee9ec	test: T21-for-proc — dropout-live regression under process-per-GPU Analogue of the ddp_dropout_is_live_and_p0_bit_identical test (T21, thread-per- GPU) for the process-per-GPU launcher. Runs launch_processes twice on the same corpus / init / config with the ONLY difference being cfg.dropout (passed launcher→worker via a new XTRAIN_TEST_DROPOUT env — worker re-execs cannot inherit argv changes), reads rank 0's loss trajectory from both runs, and asserts GATE B: max \|loss diff\| > 1e-3. The threshold sits ~4 orders of magnitude above this box's KI-5 cross-rank NCCL noise floor (~1e-7), so it is an unambiguous "dropout mask is applied" signal, not a noise measurement. Pre-fix (missing cfg.dropout = ... in the worker / launcher, exactly the gap the paired launcher commit closes) both traces are bit-identical and this test FAILs. Also wires ENV_DROPOUT into the shared worker entry so the existing correctness test's contract is unchanged (absent env → 0.0 → same synth run as before). p0/ and p02/ subdirs isolate the two invocations' dumps.	2026-07-01 13:51:31 +08:00
Gahow Wang	86de6bfb51	distributed: T21-for-proc — wire --dropout into the process-per-GPU launcher T21 fixed --dropout under thread-per-GPU (train_ddp): added the flag, set cfg.dropout, and made train_rank re-assert model.train() each step so the training forward stays live across periodic eval flips. The process-per-GPU launcher (train_ddp_mp) was left out: it never parsed --dropout, so cfg.dropout stayed at Config::from_arch's 0.0 default, and the worker's model built with dropout permanently disabled — silently, regardless of what the user passed. The gap is the exact same launcher-wiring class the V9-PILOT caught: op-level + single-GPU tests pass, the DDP-thread T21 regression test passes, but the proc-per-GPU launcher path was never exercised end-to-end with dropout>0. Mirror bin/train_ddp exactly: parse --dropout (default 0, bit-identical default), set cfg.dropout before build_model, print an ON banner on rank 0. train_rank's per-step model.train() from T21 is reused unchanged (proc-per-GPU uses the same train_rank). Follow-up test that exercises this wiring end-to-end (GATE B loss-trace divergence between p=0 and p=0.2 under process-per-GPU) lands in the next commit.	2026-07-01 13:51:17 +08:00
Gahow Wang	fbf4ac2917	sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path used by the v12 SFT runs: - cross_entropy ignores negative targets (-100 ignore-index), normalizing by valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu). - Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is formatted as 'User: .. \nAssistant:' + answer + <\|endoftext\|>, prompt tokens masked to -100 while answer+EOS are supervised; i32 label cache alongside the u16 token cache; sample() retries windows that are fully masked; eval uses target_window so masking applies to val loss too (data.rs, train_loop.rs). - train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues training from a base checkpoint. - greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt generation eval. Test fixtures updated for the new Corpus.labels field; dropout.rs carries incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout); correctness rests on the documented v12 base+SFT runs on the GPU box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:02 +08:00
Gahow Wang	980605474b	test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical) Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:22:49 +08:00
Gahow Wang	81f3cf59e5	distributed: T21 — wire dropout into the DDP path (--dropout + model.train()) V9-PILOT caught a launcher-level integration gap: T18 wired dropout into the single-GPU bin/train, but the DDP path never did. train_ddp had no --dropout flag and never set cfg.dropout, and ddp.rs::train_rank never called model.train() — so under DDP every forward ran in the default eval mode and dropout was a silent identity, regardless of config. Fix, mirroring the single-GPU train/eval discipline: - train_ddp.rs: add a --dropout <p> flag (default 0 = off, matching the prior behavior) and set cfg.dropout from it; log it when on. - ddp.rs::train_rank: call model.train() at the start of each step (before the micro-batch loop). eval_loss() flips the model to eval mode and does not restore it, so re-asserting train() each step keeps dropout live across eval boundaries. --dropout 0 (default) is bit-identical to the prior DDP path: cfg.dropout stays 0 and ops::dropout(p=0) is a clone no-op regardless of training mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:08:17 +08:00
Gahow Wang	4abb17383a	test: process-per-GPU DDP correctness (ddp_proc.rs) Self-launching test: worker mode (XTRAIN_RANK set) trains on synthetic corpus and dumps loss+params; launcher mode runs single-GPU baseline + thread-per-GPU launch + spawns 2 worker processes, then asserts (a) proc loss == single-GPU <1e-3, (b) cross-rank params <1e-6 (KI-5 ULP), (c) proc loss == thread-per-GPU <1e-3. Run with --test-threads=1 (distributed harness property). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:52 +08:00
Gahow Wang	a188c8a277	distributed: train_ddp_mp bin (process-per-GPU launcher/worker) Dual-mode binary self-detecting via XTRAIN_RANK: launcher spawns one worker per visible GPU forwarding full argv; worker rebuilds config from argv and runs run_worker. CLI flags identical to train_ddp (thread-per-GPU, kept), so it doubles as the before->after throughput driver. thread-per-GPU path untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:52 +08:00
Gahow Wang	ffd548b80b	distributed: process-per-GPU launcher + worker (proc.rs) torchrun-style process-per-GPU: launch_processes spawns one worker process per GPU (re-exec current_exe with XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID} env), mints the ncclUniqueId once in the launcher and hex-injects it via env (no shared FS/TCP, race-free). worker_env/run_worker read the env, bind the device (own CUDA context), DdpContext::init + build_model + train_rank reused from T8 UNCHANGED. hex_encode/decode_unique_id are host-testable pure fns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:43 +08:00
Gahow Wang	830d06ad01	gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) - repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each kv head sums its group of query-head grads; no atomics) + Tensor/ops node. - Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim; attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical. - --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export writes real num_key_value_heads (xserv repeat_kv grouping aligned). - Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape); parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:37:37 +08:00
Gahow Wang	f26db882e5	Merge t16-grad-accum into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # docs/evolution.md	2026-06-18 00:37:11 +08:00
Gahow Wang	abe5ceb913	test: grad-accum equivalence + accum=1 bit-identity + DDP+accum - grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch; accum_steps=1 bit-identical (max\|Δ\|==0) to no-accum; real train() loop with accum tracks a big-batch baseline over 20 AdamW steps. - ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of the same effective size (loss + cross-rank + vs-baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:40 +08:00
Gahow Wang	7a03b0054a	train+ddp: micro-batch gradient accumulation (--accum-steps) Accumulate grads over N micro-batches, then one AdamW step + zero_grad, for an effective batch of N×micro at one micro-batch's activation cost. Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates the scaled grads) so the boundary grad equals a single step over an N× batch. accum==1 skips the scale → bit-identical to the pre-T16 path. DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary (intermediate micro-steps are local-only, no NCCL); the /world average is orthogonal to the per-micro 1/N, so the boundary grad is the effective global-batch mean. New --accum-steps flag in both train binaries; effective batch is printed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:33 +08:00
Gahow Wang	5f3b81ac96	test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile) + flash_matches_composed_fwd. model/tests/flash.rs: flash==composed on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump: XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle (PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:39 +08:00
Gahow Wang	f202351be5	model: per-block activation recompute (--recompute) Wrap each transformer block's forward in the checkpoint primitive when recompute is enabled (Phase T13 / KI-3). To make the block forward a pure segment fn (no `&self` borrow, so it can re-run in the backward closure), extract the block body + its helpers (linear / norm_gamma / attention / swiglu_mlp) into free functions parameterised by (cfg, compute_dtype) and add `Block::block_params()` (the 11 leaves in the params() per-block order). The non-recompute path calls `block_forward` directly — identical graph to before. - `TinyTransformer::with_recompute(bool)` builder (opt-in; default off keeps the unchanged tape / bit-identical numerics). - `--recompute` flag wired into bin/train and bin/train_ddp (DDP: each rank checkpoints independently). Correctness gate: tests/recompute.rs builds two identical models (recompute on/off), runs the same batched loss+backward, and asserts the forward logits, the loss, and EVERY parameter grad match within tight fp tol — parameterised over fp32 and bf16 (T12 composition). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:42:42 +08:00
Gahow Wang	0a2a4dcaa8	train: --bf16 flag (fp32-master AMP) + bf16 correctness test - TinyTransformer::with_compute_dtype(BF16): embedding stays fp32 master then casts to bf16; each linear casts its fp32 weight to bf16 on the fly; logits cast back to fp32 for cross-entropy. Default F32 reproduces the v0-v4 forward graph bit-for-bit. - --bf16 flag on bin/train and bin/train_ddp (off by default). - tests/bf16.rs: same fp32 master weights run fp32 vs bf16; assert loss/logits/grads within a loose bf16 tol, no NaN, and grads are fp32 (master untouched). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:55 +08:00
Gahow Wang	b7104e2cb7	test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8 The cross-rank `max\|p0-p1\| == 0.0` check is flaky on this PCIe-only box: NCCL's all-reduce is not bit-reproducible run-to-run across ranks (algorithm/chunk choice is unstable), so cross-rank params can differ by a few ULP (observed <=1.2e-7) even with identical init + averaged grads. The load-bearing gate is the loss-trajectory match (~5.7e-7); a tight <1e-6 tolerance is the honest invariant. Also extend ddp_throughput_scaling to include world=8 for the KI-5 before/after scaling table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00
Gahow Wang	88c2c15768	Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" This reverts commit `b8b58212dc`.	2026-06-16 09:39:38 +08:00
Gahow Wang	b8b58212dc	dist: coalesce grads into buckets for all-reduce (KI-5) Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls for dim512, DDP's dominant cost after T10's batched forward) with a coalesced bucketed all-reduce: pack grads into a few large contiguous scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/ End), fold the 1/world average into one per-bucket scale, unpack back. The packed buffer is the concatenation of the grad tensors, so NCCL's element-wise sum over a bucket equals the per-tensor sums — bit-identical to the un-bucketed path; only launch/latency overhead is removed. DDP cross-rank param identity + loss-match are preserved. Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:09:44 +08:00
Gahow Wang	25b032445d	train: real batched step (drop loop+SUM) Feed a real batch of B sequences as ONE batched forward/backward, replacing the "loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows is already the batch-mean loss, so backward yields the batch-mean gradient directly → clip pre-scale = 1.0. DDP stays equivalent: each rank runs one batched forward over its b_local = B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average (sum across ranks /world) = Σ_global/B_global = global batch-mean → clip pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way. DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:33 +08:00
Gahow Wang	7090b475fb	train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch) The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch (scaling-ladder rung), the cached token-id stream (`load_cached`), held-out val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the val corpus and runs the no-grad eval / writes the best checkpoint (params are bit-identical across ranks). The eval/checkpoint logic is reused from `xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated. - DdpConfig gains eval_every / eval_batches / ckpt_path. - train_rank takes `valid: Option<&Corpus>` and returns DdpResult (losses + evals + best_val); launch threads the val corpus to rank 0 only. - bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus + --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/ --val-tokens/--eval-every/--ckpt), reusing the u16 cache. - DDP correctness test updated to the new signatures (semantics unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:34:40 +08:00
Gahow Wang	ad82e8bf92	dist: lengthen scaling bench so NCCL init amortizes 30-step bench charged the one-time NCCL init + 4 model builds (present at world=4, absent at world=1) against the wall clock, understating steady-state scaling (in-loop tok/s already showed ~53k at 4 GPUs). Bump to 150 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:18:23 +08:00
Gahow Wang	818f76a18f	dist: drop unused import; relax DDP-vs-single-GPU param tolerance dash5 verify: loss trajectory matches single-GPU to max_rel 1.16e-7 and cross-rank params are bit-identical (0.0), but DDP-vs-single-GPU per-param rel diff is ~2.8e-3 after 20 AdamW steps — expected, since the two differ only in gradient summation order (fp add isn't associative) and that rounding compounds. Bump check (c) 1e-3 -> 1e-2 (a/b stay tight). Also remove an unused DType import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:17:31 +08:00
Gahow Wang	cf5e3987df	dist: multi-rank launcher + ddp acceptance test bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects the set), NCCL all-reduce gradients each step, train the tiny transformer on TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda build keeps a stub main. tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data -> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:41 +08:00
Gahow Wang	163f567c80	dist: ddp all-reduce + sharded batch DDP training step (train_rank) on top of DdpContext: each rank advances the SAME RNG, draws the whole global batch, and runs forward+backward only on its shard (i % world == rank) so the union over ranks is the single-GPU batch in the same order. After backward, all-reduce-average the device grads, then finish the mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init + same averaged grad + same optimizer state keep params bit-identical across ranks. Adds a deterministic build_model (same LCG init as bin/train) shared by ranks + baseline, a per-step loss all-reduce for the reported global-mean loss, and the thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank builds its model thread-locally, only UniqueId/config/&Corpus cross threads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:29 +08:00
Gahow Wang	e27df50ca9	dist: nccl ffi + comm bootstrap New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End}, ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper — rank 0 mints the UniqueId, every rank inits its communicator under a group, and all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs on the null stream so it orders with the model's kernels (no extra barrier). build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate compiles to empty, cargo check passes host-side); with nvcc, links -lnccl -lcudart like xserv-distributed's build.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:14:56 +08:00

26 Commits