xtrain

Files

Gahow Wang 25b032445d train: real batched step (drop loop+SUM)

Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.

DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 00:44:33 +08:00

xtrain-autodiff

autograd: batch dim for ops (flatten linears, batched attention)

2026-06-16 00:44:15 +08:00

xtrain-cuda

autograd: batch dim for ops (flatten linears, batched attention)

2026-06-16 00:44:15 +08:00

xtrain-distributed

train: real batched step (drop loop+SUM)

2026-06-16 00:44:33 +08:00

xtrain-model

model: batched forward [B,S]