New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so it orders naturally with the model's kernels. 2-GPU AllReduce test included. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9 lines
165 B
TOML
9 lines
165 B
TOML
[package]
|
|
name = "xserv-distributed"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
|
|
[dependencies]
|
|
xserv-cuda = { path = "../xserv-cuda" }
|
|
half.workspace = true
|