Files
xtrain/docs
Gahow Wang c470c627a7 docs: Phase T17 — process-per-GPU DDP design
torchrun-style: launcher spawns N worker processes, each with its own CUDA
context; cross-process ncclUniqueId distributed via launcher-minted hex env
injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused
unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped
(user scope decision).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 17:44:38 +08:00
..