Commit Graph

10 Commits

Author SHA1 Message Date
1c76573cb4 export: safetensors + config.json for xserv qwen3
New bin export_safetensors: load an xtrain checkpoint, map every param to its
HF Qwen3 tensor name, transpose 2D projection weights [in,out]->[out,in]
(1D norms + [vocab,dim] embed/lm_head kept), cast to BF16 (xserv's qwen3
forward is BF16-only), and write config.json + model.safetensors + a copy of
the gpt2 tokenizer.json. Sized exactly like bin/train.rs. safetensors 0.5 to
match xserv. GPU body gated behind not(no_cuda).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:33:26 +08:00
163f567c80 dist: ddp all-reduce + sharded batch
DDP training step (train_rank) on top of DdpContext: each rank advances the
SAME RNG, draws the whole global batch, and runs forward+backward only on its
shard (i % world == rank) so the union over ranks is the single-GPU batch in the
same order. After backward, all-reduce-average the device grads, then finish the
mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the
single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init +
same averaged grad + same optimizer state keep params bit-identical across ranks.

Adds a deterministic build_model (same LCG init as bin/train) shared by ranks +
baseline, a per-step loss all-reduce for the reported global-mean loss, and the
thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank
builds its model thread-locally, only UniqueId/config/&Corpus cross threads).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:29 +08:00
e27df50ca9 dist: nccl ffi + comm bootstrap
New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL
FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End},
ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper —
rank 0 mints the UniqueId, every rank inits its communicator under a group, and
all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device
buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs
on the null stream so it orders with the model's kernels (no extra barrier).

build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate
compiles to empty, cargo check passes host-side); with nvcc, links -lnccl
-lcudart like xserv-distributed's build.rs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:14:56 +08:00
7d84a64f5c data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip
New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch
GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves
on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the
corpus into one id stream and samples fixed-length (input, target) next-token
windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole
stories (<|endoftext|> boundaries).

Also the host-only training math: LrSchedule (linear warmup + cosine decay)
and global L2 grad-norm + clip scale, each with a local unit test.

Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid
(fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution
noted: a real TinyStories subset, not the full set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:29:32 +08:00
f22429f5b8 optim: hand-written AdamW (decoupled weight decay + bias correction)
New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.

Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:28:23 +08:00
e3912c2380 model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test
New crate xtrain-model: a from-scratch decoder built entirely from the
autodiff op set.
- Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64).
- TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal
  attention (RoPE, additive causal mask, per-head SDPA) -> residual;
  pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head.
  x@W weight convention (engine GEMM is plain A@B); dim=n_heads*head_dim.
- params()/zero_grad-able leaves for the optimizer; param_to_host export.
- overfit test: char-level bring-up (embedded text -> vocab -> shifted
  targets), minimal hand-written GD (p -= lr*grad) memorises one fixed
  batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd
  correctness signal. Gated #![cfg(not(no_cuda))].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:20 +08:00
e7ce504b1f ops: differentiable autograd nodes + per-op grad-check tests
ops.rs wraps each Tensor op as a Var node with its backward closure (forward
caches captured by move). swiglu = mul(silu(gate), up); attention is composed
(matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks
every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test
(dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds
xtrain-cuda dev-dep for device selection in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
9ca98efd98 autodiff: finite-diff gradient-check harness
New xtrain-autodiff crate with a reusable central finite-difference
gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an
analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative
tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so
T4's per-op backward checks can reuse it directly. Includes host unit
tests (sum(x²) grad 2x passes; a wrong grad is rejected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:42 +08:00
fbd07a578c tensor: minimal Tensor crate over xtrain-cuda
New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted
host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with
creation, host↔device transfer, and a scale() op driving the elementwise
kernel. GPU integration tests (host↔device roundtrip + scale correctness)
gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the
kernel call sites compile out locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:13:06 +08:00
92acf9f413 T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)
Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's
csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA
Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu
via the cc crate targeting sm_120 (RTX 5090) and links cudart.

A gated integration test runs the vector-add kernel on the GPU and asserts the
result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA
compilation and sets a `no_cuda` cfg so host-side cargo check still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:42:43 +08:00