Files
xtrain/README.md
Gahow Wang 92acf9f413 T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)
Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's
csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA
Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu
via the cc crate targeting sm_120 (RTX 5090) and links cudart.

A gated integration test runs the vector-add kernel on the GPU and asserts the
result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA
compilation and sets a `no_cuda` cfg so host-side cargo check still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:42:43 +08:00

51 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# xtrain
A from-scratch **Rust + CUDA** LLM **training** engine — the sibling of
[xserv](https://github.com/) (the inference side). GPU-first.
The goal is to learn the full training-systems stack by hand: autograd / backward
passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting
is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version,
multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written
from scratch. The target architecture is a tiny modern transformer
(RoPE + RMSNorm + SwiGLU, ~130M params) whose forward aligns with xserv's Qwen3,
so the backward passes map one-to-one onto xserv's existing forward kernels and
trained weights can flow back into xserv.
## Status
Bootstrapping (P0). This repo currently contains only the project skeleton and a
working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel.
## Layout
```
xtrain/
├── Cargo.toml # workspace
├── csrc/ # CUDA sources (.cu)
│ └── test/vecadd.cu # trivial element-wise vector-add (smoke test)
└── crates/
└── xtrain-cuda/ # CUDA Runtime FFI + build.rs (nvcc → sm_120)
├── build.rs # compiles csrc/*.cu via the `cc` crate, links cudart
├── src/ # ffi / error / device / memory
└── tests/ # vecadd smoke test
```
The build mirrors xserv's approach: `build.rs` invokes `nvcc` (via the `cc` crate)
to compile `csrc/*.cu` targeting `sm_120` (RTX 5090) and links them into the Rust
crate over hand-written `extern "C"` FFI.
## Building & testing
CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120):
```sh
export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH
cargo build
cargo test -p xtrain-cuda -- --nocapture # runs the vecadd smoke test
```
On a machine without `nvcc`/GPU, `build.rs` detects the missing toolchain, skips
CUDA compilation, and sets a `no_cuda` cfg — so host-side `cargo check` still
works (the GPU smoke test is compiled out).