# xtrain A from-scratch **Rust + CUDA** LLM **training** engine — the sibling of [xserv](https://github.com/) (the inference side). GPU-first. The goal is to learn the full training-systems stack by hand: autograd / backward passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version, multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written from scratch. The target architecture is a tiny modern transformer (RoPE + RMSNorm + SwiGLU, ~1–30M params) whose forward aligns with xserv's Qwen3, so the backward passes map one-to-one onto xserv's existing forward kernels and trained weights can flow back into xserv. ## Status Bootstrapping (P0). This repo currently contains only the project skeleton and a working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel. ## Layout ``` xtrain/ ├── Cargo.toml # workspace ├── csrc/ # CUDA sources (.cu) │ └── test/vecadd.cu # trivial element-wise vector-add (smoke test) └── crates/ └── xtrain-cuda/ # CUDA Runtime FFI + build.rs (nvcc → sm_120) ├── build.rs # compiles csrc/*.cu via the `cc` crate, links cudart ├── src/ # ffi / error / device / memory └── tests/ # vecadd smoke test ``` The build mirrors xserv's approach: `build.rs` invokes `nvcc` (via the `cc` crate) to compile `csrc/*.cu` targeting `sm_120` (RTX 5090) and links them into the Rust crate over hand-written `extern "C"` FFI. ## Building & testing CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120): ```sh export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH cargo build cargo test -p xtrain-cuda -- --nocapture # runs the vecadd smoke test ``` On a machine without `nvcc`/GPU, `build.rs` detects the missing toolchain, skips CUDA compilation, and sets a `no_cuda` cfg — so host-side `cargo check` still works (the GPU smoke test is compiled out).