Gahow Wang 511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98)
v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale
dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute.
8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h.

Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 /
v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085
AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were
partly capacity-limited, scaling capacity > repeating old data. But the gain
is only ~3% (same magnitude as the data-axis single-step lever), and v8's
val was still descending at the end (not saturated).

Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6,
capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress,
scale capacity AND data together (Chinchilla, reproduced at toy scale).

- docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples
- docs/runs/README.md: +v8 row, v9 proposal
- docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 15:12:01 +08:00
2026-06-15 17:14:56 +08:00

xtrain

A from-scratch Rust + CUDA LLM training engine — the sibling of xserv (the inference side). GPU-first.

The goal is to learn the full training-systems stack by hand: autograd / backward passes / optimizers (AdamW) / the training loop / distributed logic. Heavy lifting is borrowed where it makes sense (GEMM → cuBLAS after a hand-written version, multi-GPU comms → NCCL, tokenizer → reused from xserv), but the core is written from scratch. The target architecture is a tiny modern transformer (RoPE + RMSNorm + SwiGLU, ~130M params) whose forward aligns with xserv's Qwen3, so the backward passes map one-to-one onto xserv's existing forward kernels and trained weights can flow back into xserv.

Status

Bootstrapping (P0). This repo currently contains only the project skeleton and a working Rust↔CUDA build chain, verified by a trivial vector-add CUDA kernel.

Layout

xtrain/
├── Cargo.toml              # workspace
├── csrc/                   # CUDA sources (.cu)
│   └── test/vecadd.cu      # trivial element-wise vector-add (smoke test)
└── crates/
    └── xtrain-cuda/        # CUDA Runtime FFI + build.rs (nvcc → sm_120)
        ├── build.rs        # compiles csrc/*.cu via the `cc` crate, links cudart
        ├── src/            # ffi / error / device / memory
        └── tests/          # vecadd smoke test

The build mirrors xserv's approach: build.rs invokes nvcc (via the cc crate) to compile csrc/*.cu targeting sm_120 (RTX 5090) and links them into the Rust crate over hand-written extern "C" FFI.

Building & testing

CUDA compilation and execution happen on a GPU box (dash5, 8× RTX 5090, sm_120):

export PATH=/usr/local/cuda/bin:$HOME/.cargo/bin:$PATH
cargo build
cargo test -p xtrain-cuda -- --nocapture   # runs the vecadd smoke test

On a machine without nvcc/GPU, build.rs detects the missing toolchain, skips CUDA compilation, and sets a no_cuda cfg — so host-side cargo check still works (the GPU smoke test is compiled out).

Description
No description provided
Readme 3.1 MiB
Languages
Rust 87.6%
Cuda 8.7%
Python 2.2%
Shell 1.5%