Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls
for dim512, DDP's dominant cost after T10's batched forward) with a
coalesced bucketed all-reduce: pack grads into a few large contiguous
scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/
End), fold the 1/world average into one per-bucket scale, unpack back.
The packed buffer is the concatenation of the grad tensors, so NCCL's
element-wise sum over a bucket equals the per-tensor sums — bit-identical
to the un-bucketed path; only launch/latency overhead is removed. DDP
cross-rank param identity + loss-match are preserved.
Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>