xtrain

Author	SHA1	Message	Date
Gahow Wang	e27df50ca9	dist: nccl ffi + comm bootstrap New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End}, ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper — rank 0 mints the UniqueId, every rank inits its communicator under a group, and all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs on the null stream so it orders with the model's kernels (no extra barrier). build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate compiles to empty, cargo check passes host-side); with nvcc, links -lnccl -lcudart like xserv-distributed's build.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:14:56 +08:00
Gahow Wang	7d84a64f5c	data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the corpus into one id stream and samples fixed-length (input, target) next-token windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole stories (<\|endoftext\|> boundaries). Also the host-only training math: LrSchedule (linear warmup + cosine decay) and global L2 grad-norm + clip scale, each with a local unit test. Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid (fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution noted: a real TinyStories subset, not the full set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:32 +08:00
Gahow Wang	f22429f5b8	optim: hand-written AdamW (decoupled weight decay + bias correction) New xtrain-optim crate. AdamW with per-param m/v moments keyed by params() index, global bias correction, and decoupled weight decay (matches torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers, unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr argument leaves room for an LR schedule. Host unit test checks the update against an independent reference recurrence over 20 steps and the pure-decay (g=0) boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:28:23 +08:00
Gahow Wang	e3912c2380	model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test New crate xtrain-model: a from-scratch decoder built entirely from the autodiff op set. - Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64). - TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal attention (RoPE, additive causal mask, per-head SDPA) -> residual; pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head. x@W weight convention (engine GEMM is plain A@B); dim=n_headshead_dim. - params()/zero_grad-able leaves for the optimizer; param_to_host export. - overfit test: char-level bring-up (embedded text -> vocab -> shifted targets), minimal hand-written GD (p -= lrgrad) memorises one fixed batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd correctness signal. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	9ca98efd98	autodiff: finite-diff gradient-check harness New xtrain-autodiff crate with a reusable central finite-difference gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so T4's per-op backward checks can reuse it directly. Includes host unit tests (sum(x²) grad 2x passes; a wrong grad is rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:42 +08:00
Gahow Wang	fbd07a578c	tensor: minimal Tensor crate over xtrain-cuda New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with creation, host↔device transfer, and a scale() op driving the elementwise kernel. GPU integration tests (host↔device roundtrip + scale correctness) gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the kernel call sites compile out locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	92acf9f413	T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test) Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:42:43 +08:00

7 Commits