xtrain

Author	SHA1	Message	Date
Gahow Wang	734e119db3	run: v4 archive + export (dim768, 8-GPU DDP, val 1.17) v4 scaling run finished: dim768/18L, core 127.43M (total 204.63M), trained 720.9M tokens (~1.54 epoch) on 8x RTX 5090 DDP fp32, ~145K tok/s, ~84 min, best val 1.1690. Checkpoint archived to registry (~/projects/tiny-models/v4-tinystories-dim768/) and exported to xserv HF Qwen3 safetensors (201 tensors, BF16); xserv serves it and matches xtrain greedy token-for-token on all 3 fixed prompts (40 tok). Add `greedy_sample` bin: load a trained ckpt with its arch flags and print xtrain's own greedy continuations for the fixed run prompts, so they can be diffed against xserv's greedy on the exported weights (the per-run token-match check). Same model/config/init scheme as bin/train.rs + bin/export_safetensors.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:28 +08:00
Gahow Wang	b7104e2cb7	test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8 The cross-rank `max\|p0-p1\| == 0.0` check is flaky on this PCIe-only box: NCCL's all-reduce is not bit-reproducible run-to-run across ranks (algorithm/chunk choice is unstable), so cross-rank params can differ by a few ULP (observed <=1.2e-7) even with identical init + averaged grads. The load-bearing gate is the loss-trajectory match (~5.7e-7); a tight <1e-6 tolerance is the honest invariant. Also extend ddp_throughput_scaling to include world=8 for the KI-5 before/after scaling table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00
Gahow Wang	28801fbfe5	cuda: device caching allocator (pool GpuBuffer alloc) Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc -> cudaMalloc, a synchronous process-serialized driver call. Under the single- process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs serialize through the driver (KI-5 root cause); it costs single-GPU too. Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a free-list (request rounded up to a size class so repeating training shapes reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool instead of cudaFree. Thread-safe via a global registry keyed by device id with each device's free-list behind its own Mutex (registry lock held only to clone out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices). The buffer records its alloc-time device so Drop returns to the right pool. Transparent: physical capacity may be rounded up, but len()/memset/copy bounds all use the requested length, so the rounded tail is never read and numerics are unchanged. zeros() still memsets (reused buffers hold stale bytes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:02 +08:00
Gahow Wang	88c2c15768	Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" This reverts commit `b8b58212dc`.	2026-06-16 09:39:38 +08:00
Gahow Wang	b8b58212dc	dist: coalesce grads into buckets for all-reduce (KI-5) Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls for dim512, DDP's dominant cost after T10's batched forward) with a coalesced bucketed all-reduce: pack grads into a few large contiguous scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/ End), fold the 1/world average into one per-bucket scale, unpack back. The packed buffer is the concatenation of the grad tensors, so NCCL's element-wise sum over a bucket equals the per-tensor sums — bit-identical to the un-bucketed path; only launch/latency overhead is removed. DDP cross-rank param identity + loss-match are preserved. Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:09:44 +08:00
Gahow Wang	25b032445d	train: real batched step (drop loop+SUM) Feed a real batch of B sequences as ONE batched forward/backward, replacing the "loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows is already the batch-mean loss, so backward yields the batch-mean gradient directly → clip pre-scale = 1.0. DDP stays equivalent: each rank runs one batched forward over its b_local = B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average (sum across ranks /world) = Σ_global/B_global = global batch-mean → clip pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way. DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:33 +08:00
Gahow Wang	5353b38402	model: batched forward [B,S] forward_batched(ids[BS], batch)/loss_batched: run B equal-length sequences as ONE forward over flattened [BS] ids, so every linear is one big [BS,dim] GEMM. Attention reshapes to [Bnh,S,hd], runs the fused batched causal SDPA (per-seq mask + RoPE period=S, no cross-sequence attention), writes back [B*S,dim]. The old per-(batch,head) loop + host-round-tripping split/merge_heads + the additive causal_mask leaf are gone. forward(ids[seq]) is now forward_batched(ids,1), so the sampler / inference path (batch=1) is unchanged. +batched_ids_tensor helper. New batched.rs test: batched forward == looped single-sequence (logits identical 0.0, grads 6.4e-4, loss identical). PyTorch parity now exercises B>1 (B=2,S=4): loss 5e-8, logits 6.9e-6, all 25 param grads within rtol — verifying per-seq RoPE position + per-seq causal masking. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:25 +08:00
Gahow Wang	7821bd9c34	autograd: batch dim for ops (flatten linears, batched attention) Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE already act on flat [rows,dim], so they work unchanged on [BS,dim]; only attention + RoPE need sequence awareness: - RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e. per-sequence position on a flattened batch (period == tokens = single seq). - Fused batched causal attention: new `Tensor::attention`/`attention_backward` + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the Bnh (sequence,head) blocks (new sgemm_strided_batched binding) and a causal softmax kernel (scale + per-row causal mask inline) — the whole attention is 3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip. - transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads. grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside the existing 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:15 +08:00
Gahow Wang	7090b475fb	train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch) The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch (scaling-ladder rung), the cached token-id stream (`load_cached`), held-out val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the val corpus and runs the no-grad eval / writes the best checkpoint (params are bit-identical across ranks). The eval/checkpoint logic is reused from `xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated. - DdpConfig gains eval_every / eval_batches / ckpt_path. - train_rank takes `valid: Option<&Corpus>` and returns DdpResult (losses + evals + best_val); launch threads the val corpus to rank 0 only. - bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus + --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/ --val-tokens/--eval-every/--ckpt), reusing the u16 cache. - DDP correctness test updated to the new signatures (semantics unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:34:40 +08:00
Gahow Wang	ec8114ecbc	train: --eval-ckpt eval-only mode (v0-vs-v1 same-set val loss) Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an existing checkpoint into a model of the given arch and score it on the held-out val split, then exit. Lets v0 and v1 be measured on the identical validation set (the acceptance metric) without a separate eval binary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:44:40 +08:00
Gahow Wang	e44e50ef78	data: full TinyStories + tokenized-id cache, val loss, CLI arch - Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to <corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns. - Corpus::split_tail: hold out a tail slice as a validation corpus. - train(): take an optional valid corpus + eval_every/eval_batches; periodic deterministic val-loss eval that checkpoints the BEST val model; returns TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved. - bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/ --layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/ --eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config. - gitignore the multi-GB corpus + .u16.bin caches + .ckpt (dash5-only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:34:48 +08:00
Gahow Wang	15f1e526c7	train: parameterize model size (scaling ladder) Add Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn) so the model size is a tunable rung instead of a hardcoded tiny config, and Config::core_params() (num_params minus the two vocab×dim tables) — the figure the ladder is sized against (the 50257-vocab embed+lm_head adds a fixed ~25M that is not capacity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:34:39 +08:00
Gahow Wang	e246c3bec2	export: dump_logits bin for xserv-vs-xtrain comparison xtrain-side top-k next-token logit dump (f32 forward, same model/config/ckpt as the exporter) mirroring xserv's dump-logits, so the closed-loop check can compare both sides numerically for the same prompt + weights. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:36:41 +08:00
Gahow Wang	1c76573cb4	export: safetensors + config.json for xserv qwen3 New bin export_safetensors: load an xtrain checkpoint, map every param to its HF Qwen3 tensor name, transpose 2D projection weights [in,out]->[out,in] (1D norms + [vocab,dim] embed/lm_head kept), cast to BF16 (xserv's qwen3 forward is BF16-only), and write config.json + model.safetensors + a copy of the gpt2 tokenizer.json. Sized exactly like bin/train.rs. safetensors 0.5 to match xserv. GPU body gated behind not(no_cuda). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:26 +08:00
Gahow Wang	7a4f69e430	model: add per-head QK-norm (Qwen3-compat) for xserv export xserv's Qwen3 forward unconditionally applies per-head RMSNorm to Q and K (q_norm/k_norm, shape [head_dim]) before RoPE — even gamma=1 is a real RMS divide, not identity. xtrain never had this, so an exact xserv<->xtrain loop was structurally impossible. Add it (reusing the 2D rms_norm op on the [seq*nh, hd] head rows, inserted between reshape and rope to mirror qwen3.rs's order) so the trained model is genuinely Qwen3-compatible. params() inserts q_norm,k_norm after wv; num_params() counts them; the PyTorch parity refs (parity.py / adamw_parity.py) + their name lists add the same step so the dumps stay self-consistent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:19 +08:00
Gahow Wang	ad82e8bf92	dist: lengthen scaling bench so NCCL init amortizes 30-step bench charged the one-time NCCL init + 4 model builds (present at world=4, absent at world=1) against the wall clock, understating steady-state scaling (in-loop tok/s already showed ~53k at 4 GPUs). Bump to 150 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:18:23 +08:00
Gahow Wang	818f76a18f	dist: drop unused import; relax DDP-vs-single-GPU param tolerance dash5 verify: loss trajectory matches single-GPU to max_rel 1.16e-7 and cross-rank params are bit-identical (0.0), but DDP-vs-single-GPU per-param rel diff is ~2.8e-3 after 20 AdamW steps — expected, since the two differ only in gradient summation order (fp add isn't associative) and that rounding compounds. Bump check (c) 1e-3 -> 1e-2 (a/b stay tight). Also remove an unused DType import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:17:31 +08:00
Gahow Wang	cf5e3987df	dist: multi-rank launcher + ddp acceptance test bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects the set), NCCL all-reduce gradients each step, train the tiny transformer on TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda build keeps a stub main. tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data -> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:41 +08:00
Gahow Wang	163f567c80	dist: ddp all-reduce + sharded batch DDP training step (train_rank) on top of DdpContext: each rank advances the SAME RNG, draws the whole global batch, and runs forward+backward only on its shard (i % world == rank) so the union over ranks is the single-GPU batch in the same order. After backward, all-reduce-average the device grads, then finish the mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init + same averaged grad + same optimizer state keep params bit-identical across ranks. Adds a deterministic build_model (same LCG init as bin/train) shared by ranks + baseline, a per-step loss all-reduce for the reported global-mean loss, and the thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank builds its model thread-locally, only UniqueId/config/&Corpus cross threads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:29 +08:00
Gahow Wang	e27df50ca9	dist: nccl ffi + comm bootstrap New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End}, ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper — rank 0 mints the UniqueId, every rank inits its communicator under a group, and all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs on the null stream so it orders with the model's kernels (no extra barrier). build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate compiles to empty, cargo check passes host-side); with nvcc, links -lnccl -lcudart like xserv-distributed's build.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:14:56 +08:00
Gahow Wang	a842e432b5	perf: streams / drop per-op sync Default-stream kernels run in order and every host read goes through a stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize after each kernel was pure overhead — remove all 21 in tensor.rs. Host data is still correctly ordered by the D2H memcpy that reads it. Also zero op-output buffers with cudaMemset (device-side, async) instead of a blocking H2D memcpy of a host zero buffer on every allocation — that copy was itself a hidden per-op sync point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:56:17 +08:00
Gahow Wang	8070c1949a	perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:52 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	0e5c7d22e2	perf: cuBLAS matmul fwd/bwd Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so the T3 cuBLAS tolerance and downstream grad-checks are preserved. - cublas.rs: thread-local persistent handle + row-major sgemm helper with transpose flags (col-major⟺row-major as the T3 oracle does). - matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose kernels + their allocations the T3 version ran. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:48:35 +08:00
Gahow Wang	5df1d4d57b	test: resolve real_training corpus default via CARGO_MANIFEST_DIR cargo runs tests with cwd = crate dir, so the bare relative default data/tinystories-valid-3mb.txt didn't resolve. Anchor it to the repo root via CARGO_MANIFEST_DIR so the test runs out of the box (still overridable with XTRAIN_CORPUS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:41:12 +08:00
Gahow Wang	2f8118fda9	test: tighten AdamW parity (f32 reference, 10 steps, allclose tol) The loss trajectory already matched torch.optim.AdamW (worst relerr ~2e-4), but the float64 torch reference diverged per-weight from the f32 GPU training after the model memorised the batch (flat region: weights underdetermined, loss identical). Fixes: run the torch reference in float32 (match engine precision), shorten to 10 steps (weights still well-determined), and compare final params with an allclose-style rtol+atol metric (a pure relative metric is misleading on near-zero weights). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:34:18 +08:00
Gahow Wang	22b7434b23	test: AdamW PyTorch parity + checkpoint round-trip + real training Acceptance tests (GPU-gated not(no_cuda), run on dash5): - adamw_parity_dump.rs + adamw_parity.py: build the tiny model with fixed init, run N AdamW steps on a fixed batch, dump the loss trajectory + final params; the Python side rebuilds the identical model and runs torch.optim.AdamW with matched lr/wd/betas/eps, comparing trajectory + final params within rtol. - checkpoint_roundtrip.rs: train a few steps, save, load into a fresh model with a DIFFERENT init, assert identical logits/loss on a fixed input. - real_training.rs (#[ignore], --release): train on TinyStories for a bounded budget; assert loss drops substantially and print greedy samples. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:06 +08:00
Gahow Wang	77a82bfeee	train: loop + checkpoint save/load + sampler + train binary Training loop (train_loop.rs): sample batch_size sequences, forward loss + backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints periodically; returns the loss trace. Checkpoint (checkpoint.rs): flat little-endian dump of params() in order (magic/version/count + per-param ndim/dims/f32 data); load_into validates and overwrites a matching model's params via set_value (exact f32 round-trip). Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs forward on the growing prefix (model is single-sequence, RoPE pos=row). bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it buildable on a GPU-less host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:58 +08:00
Gahow Wang	7d84a64f5c	data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the corpus into one id stream and samples fixed-length (input, target) next-token windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole stories (<\|endoftext\|> boundaries). Also the host-only training math: LrSchedule (linear warmup + cosine decay) and global L2 grad-norm + clip scale, each with a local unit test. Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid (fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution noted: a real TinyStories subset, not the full set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:32 +08:00
Gahow Wang	f22429f5b8	optim: hand-written AdamW (decoupled weight decay + bias correction) New xtrain-optim crate. AdamW with per-param m/v moments keyed by params() index, global bias correction, and decoupled weight decay (matches torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers, unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr argument leaves room for an LR schedule. Host unit test checks the update against an independent reference recurrence over 20 steps and the pure-decay (g=0) boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:28:23 +08:00
Gahow Wang	603c85e1e0	model: silence torch parity warning (read loss before backward) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	3366f30c4d	model: PyTorch parity harness (weight dump + equivalent torch model) parity_dump.rs (#[ignore] fixture generator) dumps the model's exact weights, ids, forward logits, loss, and per-param grads after one backward. parity.py rebuilds the IDENTICAL model in PyTorch (same x@W convention, RoPE rotate_half pos=row, RMSNorm, SwiGLU, causal SDPA), runs fwd+bwd, and compares logits + every grad within rtol. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:07:30 +08:00
Gahow Wang	e3912c2380	model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test New crate xtrain-model: a from-scratch decoder built entirely from the autodiff op set. - Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64). - TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal attention (RoPE, additive causal mask, per-head SDPA) -> residual; pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head. x@W weight convention (engine GEMM is plain A@B); dim=n_headshead_dim. - params()/zero_grad-able leaves for the optimizer; param_to_host export. - overfit test: char-level bring-up (embedded text -> vocab -> shifted targets), minimal hand-written GD (p -= lrgrad) memorises one fixed batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd correctness signal. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	0acfa5df11	ops: grad-check the T5 structural ops Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d, and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	e7ce504b1f	ops: differentiable autograd nodes + per-op grad-check tests ops.rs wraps each Tensor op as a Var node with its backward closure (forward caches captured by move). swiglu = mul(silu(gate), up); attention is composed (matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test (dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds xtrain-cuda dev-dep for device selection in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	224f750ee4	autograd: tape engine + grad accumulation Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad + parents + backward closure. backward() seeds a scalar loss, walks reverse topo order, and pushes grads to parents. push_grad always SUMs into the grad slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the no_cuda cfg (does not propagate); engine gated, grad_check stays host-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:17 +08:00
Gahow Wang	5aef3742d6	ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy, each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor gains the matching thin wrappers; DType grows I32 for cross-entropy targets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:09 +08:00
Gahow Wang	88fbe0a85d	gemm: realistic f32 tolerances in GEMM acceptance tests Forward: compare via matrix relative error (max abs error / max\|ref\|) instead of a per-element ratio, so near-zero outputs where two correct f32 GEMMs differ only in rounding order don't inflate the metric. Backward: L = sum(W∘C) is bilinear, so central differences are truncation-free — use eps=1e-2 (sharper f32 resolution of the difference) and atol=1e-3 to floor near-zero-gradient subtraction noise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:28:57 +08:00
Gahow Wang	1384044f27	gemm: GPU acceptance tests vs cuBLAS + finite-diff Forward: hand-written tiled GEMM vs cuBLAS sgemm on random matrices (square / non-tile-aligned rect / 256³), max relative error < 1e-3, using the row-major⟺col-major identity to drive cuBLAS without explicit transposes. Backward: scalar loss L = sum(W∘C) (so dC = W), dA/dB from matmul_backward checked against the finite-diff harness. Gated behind not(no_cuda). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:58 +08:00
Gahow Wang	08c88bf360	gemm: tiled F32 forward + transpose + backward (dA/dB) Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate, boundary-masked) plus an out-of-place transpose kernel. Wire both through xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level: Tensor::matmul, transpose_2d, and matmul_backward computing dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a correctness reference in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:51 +08:00
Gahow Wang	9ca98efd98	autodiff: finite-diff gradient-check harness New xtrain-autodiff crate with a reusable central finite-difference gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so T4's per-op backward checks can reuse it directly. Includes host unit tests (sum(x²) grad 2x passes; a wrong grad is rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:42 +08:00
Gahow Wang	fbd07a578c	tensor: minimal Tensor crate over xtrain-cuda New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with creation, host↔device transfer, and a scale() op driving the elementwise kernel. GPU integration tests (host↔device roundtrip + scale correctness) gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the kernel call sites compile out locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	63dc05fd10	tensor: add scale elementwise CUDA kernel + FFI New csrc/ops/elementwise.cu (out[i]=in[i]*alpha), compiled by xtrain-cuda/build.rs and exposed via launch_scale_f32 FFI, gated behind not(no_cuda) like the existing vecadd smoke test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	92acf9f413	T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test) Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:42:43 +08:00

1 2

95 Commits