Files
xtrain/crates/xtrain-distributed
Gahow Wang 7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps)
Accumulate grads over N micro-batches, then one AdamW step + zero_grad,
for an effective batch of N×micro at one micro-batch's activation cost.
Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates
the scaled grads) so the boundary grad equals a single step over an N×
batch. accum==1 skips the scale → bit-identical to the pre-T16 path.

DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary
(intermediate micro-steps are local-only, no NCCL); the /world average is
orthogonal to the per-micro 1/N, so the boundary grad is the effective
global-batch mean. New --accum-steps flag in both train binaries; effective
batch is printed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:45:33 +08:00
..
2026-06-15 17:14:56 +08:00