distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)

New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per
thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so
it orders naturally with the model's kernels. 2-GPU AllReduce test included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-29 11:10:14 +08:00
parent 76fffb3b68
commit 453520d622
6 changed files with 234 additions and 0 deletions

View File

@@ -7,6 +7,7 @@ members = [
"crates/xserv-model",
"crates/xserv-tokenizer",
"crates/xserv-server",
"crates/xserv-distributed",
]
[workspace.package]