xtrain

Files

Gahow Wang cf5e3987df dist: multi-rank launcher + ddp acceptance test

bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects
the set), NCCL all-reduce gradients each step, train the tiny transformer on
TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda
build keeps a stub main.

tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data
-> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP
vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed
per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 17:15:41 +08:00

xtrain-autodiff

ops: grad-check the T5 structural ops

2026-06-15 16:05:20 +08:00

xtrain-cuda

perf: streams / drop per-op sync

2026-06-15 16:56:17 +08:00

xtrain-distributed

dist: multi-rank launcher + ddp acceptance test

2026-06-15 17:15:41 +08:00

xtrain-model

model: silence torch parity warning (read loss before backward)