Config.dropout (default 0). TinyTransformer gets a Cell<bool> training switch
(train()/eval()/with_training, default eval = safe) + a Cell<u64> step_seed bumped
once per training forward. forward_batched derives a per-layer block_seed (pure fn
of step_seed×layer) and block_forward derives two per-site seeds, inserting
ops::dropout at the attn and ffn sub-block outputs (before each residual). The
seed is a pure function of (step_seed, layer, site) so the checkpoint (T13)
recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout
node → graph bit-identical to pre-T18.
train_loop: model.train() per step (restored after eval flips to eval); eval_loss
runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in
eval (default), so exported weights are dropout-free (xserv closed loop unaffected).
Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads);
eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout
grads match non-recompute (fp32 + bf16).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.
DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an
existing checkpoint into a model of the given arch and score it on the held-out
val split, then exit. Lets v0 and v1 be measured on the identical validation set
(the acceptance metric) without a separate eval binary.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to
<corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns.
- Corpus::split_tail: hold out a tail slice as a validation corpus.
- train(): take an optional valid corpus + eval_every/eval_batches; periodic
deterministic val-loss eval that checkpoints the BEST val model; returns
TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved.
- bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/
--layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/
--eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config.
- gitignore the multi-GB corpus + *.u16.bin caches + *.ckpt (dash5-only).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.
- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
(block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
each param's .grad() and rewriting the param buffer in place — no host
roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Training loop (train_loop.rs): sample batch_size sequences, forward loss +
backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step
with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints
periodically; returns the loss trace.
Checkpoint (checkpoint.rs): flat little-endian dump of params() in order
(magic/version/count + per-param ndim/dims/f32 data); load_into validates and
overwrites a matching model's params via set_value (exact f32 round-trip).
Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs
forward on the growing prefix (model is single-sequence, RoPE pos=row).
bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer
model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it
buildable on a GPU-less host.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>