Files
xtrain/crates
Gahow Wang 25b032445d train: real batched step (drop loop+SUM)
Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.

DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:33 +08:00
..
2026-06-16 00:44:25 +08:00