distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)
New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so it orders naturally with the model's kernels. 2-GPU AllReduce test included. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -7,6 +7,7 @@ members = [
|
||||
"crates/xserv-model",
|
||||
"crates/xserv-tokenizer",
|
||||
"crates/xserv-server",
|
||||
"crates/xserv-distributed",
|
||||
]
|
||||
|
||||
[workspace.package]
|
||||
|
||||
Reference in New Issue
Block a user