DDP training step (train_rank) on top of DdpContext: each rank advances the SAME RNG, draws the whole global batch, and runs forward+backward only on its shard (i % world == rank) so the union over ranks is the single-GPU batch in the same order. After backward, all-reduce-average the device grads, then finish the mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init + same averaged grad + same optimizer state keep params bit-identical across ranks. Adds a deterministic build_model (same LCG init as bin/train) shared by ranks + baseline, a per-step loss all-reduce for the reported global-mean loss, and the thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank builds its model thread-locally, only UniqueId/config/&Corpus cross threads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
14 lines
399 B
TOML
14 lines
399 B
TOML
[package]
|
|
name = "xtrain-distributed"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
license.workspace = true
|
|
|
|
[dependencies]
|
|
xtrain-cuda = { path = "../xtrain-cuda" }
|
|
xtrain-tensor = { path = "../xtrain-tensor" }
|
|
xtrain-autodiff = { path = "../xtrain-autodiff" }
|
|
xtrain-model = { path = "../xtrain-model" }
|
|
xtrain-optim = { path = "../xtrain-optim" }
|
|
xtrain-train = { path = "../xtrain-train" }
|