Files
xtrain/crates
Gahow Wang 163f567c80 dist: ddp all-reduce + sharded batch
DDP training step (train_rank) on top of DdpContext: each rank advances the
SAME RNG, draws the whole global batch, and runs forward+backward only on its
shard (i % world == rank) so the union over ranks is the single-GPU batch in the
same order. After backward, all-reduce-average the device grads, then finish the
mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the
single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init +
same averaged grad + same optimizer state keep params bit-identical across ranks.

Adds a deterministic build_model (same LCG init as bin/train) shared by ranks +
baseline, a per-step loss all-reduce for the reported global-mean loss, and the
thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank
builds its model thread-locally, only UniqueId/config/&Corpus cross threads).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:29 +08:00
..
2026-06-15 16:53:09 +08:00