xtrain

Files

Gahow Wang 7090b475fb train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)

The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.

- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
  (losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
  --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
  --val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 19:34:40 +08:00

xtrain-autodiff

ops: grad-check the T5 structural ops

2026-06-15 16:05:20 +08:00

xtrain-cuda

perf: streams / drop per-op sync

2026-06-15 16:56:17 +08:00

xtrain-distributed

train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)

2026-06-15 19:34:40 +08:00

xtrain-model

train: parameterize model size (scaling ladder)