dist: nccl ffi + comm bootstrap
New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL
FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End},
ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper —
rank 0 mints the UniqueId, every rank inits its communicator under a group, and
all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device
buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs
on the null stream so it orders with the model's kernels (no extra barrier).
build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate
compiles to empty, cargo check passes host-side); with nvcc, links -lnccl
-lcudart like xserv-distributed's build.rs.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
9
Cargo.lock
generated
9
Cargo.lock
generated
@@ -205,6 +205,15 @@ dependencies = [
|
||||
"cc",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "xtrain-distributed"
|
||||
version = "0.1.0"
|
||||
dependencies = [
|
||||
"xtrain-autodiff",
|
||||
"xtrain-cuda",
|
||||
"xtrain-tensor",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "xtrain-model"
|
||||
version = "0.1.0"
|
||||
|
||||
Reference in New Issue
Block a user