The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.
- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
(losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
--dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
--val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>