New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per
thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so
it orders naturally with the model's kernels. 2-GPU AllReduce test included.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>