Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
51 lines
2.1 KiB
Markdown
51 lines
2.1 KiB
Markdown
# xtrain
|
||
|
||
A from-scratch **Rust + CUDA** LLM **training** engine — the sibling of
|
||
[xserv](https://github.com/) (the inference side). GPU-first.
|
||
|
||
The goal is to learn the full training-systems stack by hand: autograd / backward
|
||
passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting
|
||
is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version,
|
||
multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written
|
||
from scratch. The target architecture is a tiny modern transformer
|
||
(RoPE + RMSNorm + SwiGLU, ~1–30M params) whose forward aligns with xserv's Qwen3,
|
||
so the backward passes map one-to-one onto xserv's existing forward kernels and
|
||
trained weights can flow back into xserv.
|
||
|
||
## Status
|
||
|
||
Bootstrapping (P0). This repo currently contains only the project skeleton and a
|
||
working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel.
|
||
|
||
## Layout
|
||
|
||
```
|
||
xtrain/
|
||
├── Cargo.toml # workspace
|
||
├── csrc/ # CUDA sources (.cu)
|
||
│ └── test/vecadd.cu # trivial element-wise vector-add (smoke test)
|
||
└── crates/
|
||
└── xtrain-cuda/ # CUDA Runtime FFI + build.rs (nvcc → sm_120)
|
||
├── build.rs # compiles csrc/*.cu via the `cc` crate, links cudart
|
||
├── src/ # ffi / error / device / memory
|
||
└── tests/ # vecadd smoke test
|
||
```
|
||
|
||
The build mirrors xserv's approach: `build.rs` invokes `nvcc` (via the `cc` crate)
|
||
to compile `csrc/*.cu` targeting `sm_120` (RTX 5090) and links them into the Rust
|
||
crate over hand-written `extern "C"` FFI.
|
||
|
||
## Building & testing
|
||
|
||
CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120):
|
||
|
||
```sh
|
||
export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH
|
||
cargo build
|
||
cargo test -p xtrain-cuda -- --nocapture # runs the vecadd smoke test
|
||
```
|
||
|
||
On a machine without `nvcc`/GPU, `build.rs` detects the missing toolchain, skips
|
||
CUDA compilation, and sets a `no_cuda` cfg — so host-side `cargo check` still
|
||
works (the GPU smoke test is compiled out).
|