Go to file

Gahow Wang 7090b475fb train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)

The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.

- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
  (losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
  --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
  --val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 19:34:40 +08:00

crates

train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)

2026-06-15 19:34:40 +08:00

csrc

perf: GPU AdamW + grad-norm

2026-06-15 16:53:09 +08:00

data

data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip

2026-06-15 16:29:32 +08:00

docs

docs: run v1 — TinyStories full, dim256

2026-06-15 19:09:46 +08:00

.gitignore

data: full TinyStories + tokenized-id cache, val loss, CLI arch

2026-06-15 18:34:48 +08:00

Cargo.lock

export: safetensors + config.json for xserv qwen3

2026-06-15 17:33:26 +08:00

Cargo.toml

dist: nccl ffi + comm bootstrap

2026-06-15 17:14:56 +08:00

README.md

T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)

2026-06-15 14:42:43 +08:00

README.md

xtrain

A from-scratch Rust + CUDA LLM training engine — the sibling of xserv (the inference side). GPU-first.

The goal is to learn the full training-systems stack by hand: autograd / backward passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version, multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written from scratch. The target architecture is a tiny modern transformer (RoPE + RMSNorm + SwiGLU, ~1–30M params) whose forward aligns with xserv's Qwen3, so the backward passes map one-to-one onto xserv's existing forward kernels and trained weights can flow back into xserv.

Status

Bootstrapping (P0). This repo currently contains only the project skeleton and a working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel.

Layout

xtrain/
├── Cargo.toml              # workspace
├── csrc/                   # CUDA sources (.cu)
│   └── test/vecadd.cu      # trivial element-wise vector-add (smoke test)
└── crates/
    └── xtrain-cuda/        # CUDA Runtime FFI + build.rs (nvcc → sm_120)
        ├── build.rs        # compiles csrc/*.cu via the `cc` crate, links cudart
        ├── src/            # ffi / error / device / memory
        └── tests/          # vecadd smoke test

The build mirrors xserv's approach: build.rs invokes nvcc (via the cc crate) to compile csrc/*.cu targeting sm_120 (RTX 5090) and links them into the Rust crate over hand-written extern "C" FFI.

Building & testing

CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120):

export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH
cargo build
cargo test -p xtrain-cuda -- --nocapture   # runs the vecadd smoke test

On a machine without nvcc/GPU, build.rs detects the missing toolchain, skips CUDA compilation, and sets a no_cuda cfg — so host-side cargo check still works (the GPU smoke test is compiled out).

README.md Unescape Escape

xtrain

Status

Layout

Building & testing

README.md