xtrain

Files

Gahow Wang 7a03b0054a train+ddp: micro-batch gradient accumulation (--accum-steps)

Accumulate grads over N micro-batches, then one AdamW step + zero_grad,
for an effective batch of N×micro at one micro-batch's activation cost.
Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates
the scaled grads) so the boundary grad equals a single step over an N×
batch. accum==1 skips the scale → bit-identical to the pre-T16 path.

DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary
(intermediate micro-steps are local-only, no NCCL); the /world average is
orthogonal to the per-micro 1/N, so the boundary grad is the effective
global-batch mean. New --accum-steps flag in both train binaries; effective
batch is printed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-17 23:45:33 +08:00

src

train+ddp: micro-batch gradient accumulation (--accum-steps)

2026-06-17 23:45:33 +08:00

tests

test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8

2026-06-16 11:04:11 +08:00

build.rs

dist: nccl ffi + comm bootstrap

2026-06-15 17:14:56 +08:00

Cargo.toml

dist: ddp all-reduce + sharded batch

2026-06-15 17:15:29 +08:00