xtrain

Author	SHA1	Message	Date
Gahow Wang	e625aa05dd	dropout: wire into model (residual sites) + train/eval switch + flag (T18) Config.dropout (default 0). TinyTransformer gets a Cell<bool> training switch (train()/eval()/with_training, default eval = safe) + a Cell<u64> step_seed bumped once per training forward. forward_batched derives a per-layer block_seed (pure fn of step_seed×layer) and block_forward derives two per-site seeds, inserting ops::dropout at the attn and ffn sub-block outputs (before each residual). The seed is a pure function of (step_seed, layer, site) so the checkpoint (T13) recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout node → graph bit-identical to pre-T18. train_loop: model.train() per step (restored after eval flips to eval); eval_loss runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in eval (default), so exported weights are dropout-free (xserv closed loop unaffected). Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads); eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout grads match non-recompute (fp32 + bf16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:32 +08:00
Gahow Wang	25b032445d	train: real batched step (drop loop+SUM) Feed a real batch of B sequences as ONE batched forward/backward, replacing the "loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows is already the batch-mean loss, so backward yields the batch-mean gradient directly → clip pre-scale = 1.0. DDP stays equivalent: each rank runs one batched forward over its b_local = B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average (sum across ranks /world) = Σ_global/B_global = global batch-mean → clip pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way. DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:33 +08:00
Gahow Wang	ec8114ecbc	train: --eval-ckpt eval-only mode (v0-vs-v1 same-set val loss) Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an existing checkpoint into a model of the given arch and score it on the held-out val split, then exit. Lets v0 and v1 be measured on the identical validation set (the acceptance metric) without a separate eval binary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:44:40 +08:00
Gahow Wang	e44e50ef78	data: full TinyStories + tokenized-id cache, val loss, CLI arch - Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to <corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns. - Corpus::split_tail: hold out a tail slice as a validation corpus. - train(): take an optional valid corpus + eval_every/eval_batches; periodic deterministic val-loss eval that checkpoints the BEST val model; returns TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved. - bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/ --layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/ --eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config. - gitignore the multi-GB corpus + .u16.bin caches + .ckpt (dash5-only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:34:48 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	77a82bfeee	train: loop + checkpoint save/load + sampler + train binary Training loop (train_loop.rs): sample batch_size sequences, forward loss + backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints periodically; returns the loss trace. Checkpoint (checkpoint.rs): flat little-endian dump of params() in order (magic/version/count + per-param ndim/dims/f32 data); load_into validates and overwrites a matching model's params via set_value (exact f32 round-trip). Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs forward on the growing prefix (model is single-sequence, RoPE pos=row). bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it buildable on a GPU-less host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:58 +08:00

6 Commits