xtrain

Author	SHA1	Message	Date
Gahow Wang	cf5e3987df	dist: multi-rank launcher + ddp acceptance test bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects the set), NCCL all-reduce gradients each step, train the tiny transformer on TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda build keeps a stub main. tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data -> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:41 +08:00
Gahow Wang	163f567c80	dist: ddp all-reduce + sharded batch DDP training step (train_rank) on top of DdpContext: each rank advances the SAME RNG, draws the whole global batch, and runs forward+backward only on its shard (i % world == rank) so the union over ranks is the single-GPU batch in the same order. After backward, all-reduce-average the device grads, then finish the mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init + same averaged grad + same optimizer state keep params bit-identical across ranks. Adds a deterministic build_model (same LCG init as bin/train) shared by ranks + baseline, a per-step loss all-reduce for the reported global-mean loss, and the thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank builds its model thread-locally, only UniqueId/config/&Corpus cross threads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:29 +08:00
Gahow Wang	e27df50ca9	dist: nccl ffi + comm bootstrap New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End}, ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper — rank 0 mints the UniqueId, every rank inits its communicator under a group, and all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs on the null stream so it orders with the model's kernels (no extra barrier). build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate compiles to empty, cargo check passes host-side); with nvcc, links -lnccl -lcudart like xserv-distributed's build.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:14:56 +08:00
Gahow Wang	5e8add2a41	docs: Phase T7 — performance Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd (row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step param/grad roundtrip), drop per-op sync + device memset. Includes the verification table (regression suite green + tok/s 2770→8220 ~3x), the deferred bf16/recompute follow-up rationale, and the T8 all-reduce note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:00:29 +08:00
Gahow Wang	a842e432b5	perf: streams / drop per-op sync Default-stream kernels run in order and every host read goes through a stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize after each kernel was pure overhead — remove all 21 in tensor.rs. Host data is still correctly ordered by the D2H memcpy that reads it. Also zero op-output buffers with cudaMemset (device-side, async) instead of a blocking H2D memcpy of a host zero buffer on every allocation — that copy was itself a hidden per-op sync point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:56:17 +08:00
Gahow Wang	8070c1949a	perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:52 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	0e5c7d22e2	perf: cuBLAS matmul fwd/bwd Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so the T3 cuBLAS tolerance and downstream grad-checks are preserved. - cublas.rs: thread-local persistent handle + row-major sgemm helper with transpose flags (col-major⟺row-major as the T3 oracle does). - matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose kernels + their allocations the T3 version ran. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:48:35 +08:00
Gahow Wang	5df1d4d57b	test: resolve real_training corpus default via CARGO_MANIFEST_DIR cargo runs tests with cwd = crate dir, so the bare relative default data/tinystories-valid-3mb.txt didn't resolve. Anchor it to the repo root via CARGO_MANIFEST_DIR so the test runs out of the box (still overridable with XTRAIN_CORPUS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:41:12 +08:00
Gahow Wang	2f8118fda9	test: tighten AdamW parity (f32 reference, 10 steps, allclose tol) The loss trajectory already matched torch.optim.AdamW (worst relerr ~2e-4), but the float64 torch reference diverged per-weight from the f32 GPU training after the model memorised the batch (flat region: weights underdetermined, loss identical). Fixes: run the torch reference in float32 (match engine precision), shorten to 10 steps (weights still well-determined), and compare final params with an allclose-style rtol+atol metric (a pure relative metric is misleading on near-zero weights). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:34:18 +08:00
Gahow Wang	29b4d30b6c	docs: Phase T6 — training loop Design doc for the T6 training stack: Goal / Module Layout / Key Design Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse, sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host unit tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:14 +08:00
Gahow Wang	22b7434b23	test: AdamW PyTorch parity + checkpoint round-trip + real training Acceptance tests (GPU-gated not(no_cuda), run on dash5): - adamw_parity_dump.rs + adamw_parity.py: build the tiny model with fixed init, run N AdamW steps on a fixed batch, dump the loss trajectory + final params; the Python side rebuilds the identical model and runs torch.optim.AdamW with matched lr/wd/betas/eps, comparing trajectory + final params within rtol. - checkpoint_roundtrip.rs: train a few steps, save, load into a fresh model with a DIFFERENT init, assert identical logits/loss on a fixed input. - real_training.rs (#[ignore], --release): train on TinyStories for a bounded budget; assert loss drops substantially and print greedy samples. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:06 +08:00
Gahow Wang	77a82bfeee	train: loop + checkpoint save/load + sampler + train binary Training loop (train_loop.rs): sample batch_size sequences, forward loss + backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints periodically; returns the loss trace. Checkpoint (checkpoint.rs): flat little-endian dump of params() in order (magic/version/count + per-param ndim/dims/f32 data); load_into validates and overwrites a matching model's params via set_value (exact f32 round-trip). Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs forward on the growing prefix (model is single-sequence, RoPE pos=row). bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it buildable on a GPU-less host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:58 +08:00
Gahow Wang	7d84a64f5c	data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the corpus into one id stream and samples fixed-length (input, target) next-token windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole stories (<\|endoftext\|> boundaries). Also the host-only training math: LrSchedule (linear warmup + cosine decay) and global L2 grad-norm + clip scale, each with a local unit test. Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid (fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution noted: a real TinyStories subset, not the full set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:32 +08:00
Gahow Wang	f22429f5b8	optim: hand-written AdamW (decoupled weight decay + bias correction) New xtrain-optim crate. AdamW with per-param m/v moments keyed by params() index, global bias correction, and decoupled weight decay (matches torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers, unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr argument leaves room for an LR schedule. Host unit test checks the update against an independent reference recurrence over 20 steps and the pure-decay (g=0) boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:28:23 +08:00
Gahow Wang	8565565647	docs: Phase T5 — tiny transformer Goal / Module Layout / Key Design Decisions (multi-head layout via reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add, x@W convention, causal mask, params API, overfit methodology) / 验证方法 with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	603c85e1e0	model: silence torch parity warning (read loss before backward) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	3366f30c4d	model: PyTorch parity harness (weight dump + equivalent torch model) parity_dump.rs (#[ignore] fixture generator) dumps the model's exact weights, ids, forward logits, loss, and per-param grads after one backward. parity.py rebuilds the IDENTICAL model in PyTorch (same x@W convention, RoPE rotate_half pos=row, RMSNorm, SwiGLU, causal SDPA), runs fwd+bwd, and compares logits + every grad within rtol. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:07:30 +08:00
Gahow Wang	e3912c2380	model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test New crate xtrain-model: a from-scratch decoder built entirely from the autodiff op set. - Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64). - TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal attention (RoPE, additive causal mask, per-head SDPA) -> residual; pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head. x@W weight convention (engine GEMM is plain A@B); dim=n_headshead_dim. - params()/zero_grad-able leaves for the optimizer; param_to_host export. - overfit test: char-level bring-up (embedded text -> vocab -> shifted targets), minimal hand-written GD (p -= lrgrad) memorises one fixed batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd correctness signal. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	0acfa5df11	ops: grad-check the T5 structural ops Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d, and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	777f3c7949	docs: Phase T4 — autograd engine Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	e7ce504b1f	ops: differentiable autograd nodes + per-op grad-check tests ops.rs wraps each Tensor op as a Var node with its backward closure (forward caches captured by move). swiglu = mul(silu(gate), up); attention is composed (matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test (dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds xtrain-cuda dev-dep for device selection in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	224f750ee4	autograd: tape engine + grad accumulation Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad + parents + backward closure. backward() seeds a scalar loss, walks reverse topo order, and pushes grads to parents. push_grad always SUMs into the grad slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the no_cuda cfg (does not propagate); engine gated, grad_check stays host-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:17 +08:00
Gahow Wang	5aef3742d6	ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy, each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor gains the matching thin wrappers; DType grows I32 for cross-entropy targets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:09 +08:00
Gahow Wang	88fbe0a85d	gemm: realistic f32 tolerances in GEMM acceptance tests Forward: compare via matrix relative error (max abs error / max\|ref\|) instead of a per-element ratio, so near-zero outputs where two correct f32 GEMMs differ only in rounding order don't inflate the metric. Backward: L = sum(W∘C) is bilinear, so central differences are truncation-free — use eps=1e-2 (sharper f32 resolution of the difference) and atol=1e-3 to floor near-zero-gradient subtraction noise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:28:57 +08:00
Gahow Wang	dde2fde297	docs: Phase T3 — GEMM fwd/bwd + finite-diff Design doc covering the tiled forward, the dA/dB math + how transpose is handled (materialize + reuse forward), the cuBLAS row-major reference, and the finite-diff harness design + how T4 reuses it per-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:27:03 +08:00
Gahow Wang	1384044f27	gemm: GPU acceptance tests vs cuBLAS + finite-diff Forward: hand-written tiled GEMM vs cuBLAS sgemm on random matrices (square / non-tile-aligned rect / 256³), max relative error < 1e-3, using the row-major⟺col-major identity to drive cuBLAS without explicit transposes. Backward: scalar loss L = sum(W∘C) (so dC = W), dA/dB from matmul_backward checked against the finite-diff harness. Gated behind not(no_cuda). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:58 +08:00
Gahow Wang	08c88bf360	gemm: tiled F32 forward + transpose + backward (dA/dB) Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate, boundary-masked) plus an out-of-place transpose kernel. Wire both through xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level: Tensor::matmul, transpose_2d, and matmul_backward computing dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a correctness reference in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:51 +08:00
Gahow Wang	9ca98efd98	autodiff: finite-diff gradient-check harness New xtrain-autodiff crate with a reusable central finite-difference gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so T4's per-op backward checks can reuse it directly. Includes host unit tests (sum(x²) grad 2x passes; a wrong grad is rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:42 +08:00
Gahow Wang	fbd07a578c	tensor: minimal Tensor crate over xtrain-cuda New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with creation, host↔device transfer, and a scale() op driving the elementwise kernel. GPU integration tests (host↔device roundtrip + scale correctness) gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the kernel call sites compile out locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	63dc05fd10	tensor: add scale elementwise CUDA kernel + FFI New csrc/ops/elementwise.cu (out[i]=in[i]*alpha), compiled by xtrain-cuda/build.rs and exposed via launch_scale_f32 FFI, gated behind not(no_cuda) like the existing vecadd smoke test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	8557a289a2	docs: Phase T2 — tensor abstraction Design doc for the minimal tensor layer: DType/shape/Storage/Tensor, host↔device copy, and one elementwise kernel (scale) wired end-to-end. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00
Gahow Wang	c1b204296b	docs: backfill T1 build-chain T1 shipped without a design doc; capture the Rust↔CUDA build chain (build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:12:55 +08:00
Gahow Wang	92acf9f413	T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test) Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:42:43 +08:00

35 Commits