Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):
single-GPU: 40226 -> 92638 tok/s (~2.3x)
DDP scaling (global batch 32*world):
world before after
1 39801 1.00x 92385 1.00x
2 47229 1.19x 146821 1.59x
4 52854 1.33x 269867 2.92x
8 48996 1.23x 461270 4.99x
8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.
Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xtrain
A from-scratch Rust + CUDA LLM training engine — the sibling of xserv (the inference side). GPU-first.
The goal is to learn the full training-systems stack by hand: autograd / backward passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version, multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written from scratch. The target architecture is a tiny modern transformer (RoPE + RMSNorm + SwiGLU, ~1–30M params) whose forward aligns with xserv's Qwen3, so the backward passes map one-to-one onto xserv's existing forward kernels and trained weights can flow back into xserv.
Status
Bootstrapping (P0). This repo currently contains only the project skeleton and a working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel.
Layout
xtrain/
├── Cargo.toml # workspace
├── csrc/ # CUDA sources (.cu)
│ └── test/vecadd.cu # trivial element-wise vector-add (smoke test)
└── crates/
└── xtrain-cuda/ # CUDA Runtime FFI + build.rs (nvcc → sm_120)
├── build.rs # compiles csrc/*.cu via the `cc` crate, links cudart
├── src/ # ffi / error / device / memory
└── tests/ # vecadd smoke test
The build mirrors xserv's approach: build.rs invokes nvcc (via the cc crate)
to compile csrc/*.cu targeting sm_120 (RTX 5090) and links them into the Rust
crate over hand-written extern "C" FFI.
Building & testing
CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120):
export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH
cargo build
cargo test -p xtrain-cuda -- --nocapture # runs the vecadd smoke test
On a machine without nvcc/GPU, build.rs detects the missing toolchain, skips
CUDA compilation, and sets a no_cuda cfg — so host-side cargo check still
works (the GPU smoke test is compiled out).