torchrun-style: launcher spawns N worker processes, each with its own CUDA
context; cross-process ncclUniqueId distributed via launcher-minted hex env
injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused
unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped
(user scope decision).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>