xtrain

Author	SHA1	Message	Date
Gahow Wang	28801fbfe5	cuda: device caching allocator (pool GpuBuffer alloc) Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc -> cudaMalloc, a synchronous process-serialized driver call. Under the single- process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs serialize through the driver (KI-5 root cause); it costs single-GPU too. Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a free-list (request rounded up to a size class so repeating training shapes reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool instead of cudaFree. Thread-safe via a global registry keyed by device id with each device's free-list behind its own Mutex (registry lock held only to clone out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices). The buffer records its alloc-time device so Drop returns to the right pool. Transparent: physical capacity may be rounded up, but len()/memset/copy bounds all use the requested length, so the rounded tail is never read and numerics are unchanged. zeros() still memsets (reused buffers hold stale bytes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:02 +08:00
Gahow Wang	d422c68704	docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky) The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the original ungrouped all-reduce is itself run-to-run nondeterministic on this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}), so the T8 test's `max\|p0-p1\| == 0.0` assertion is flaky here (passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP, numerically benign; loss-match stays ~6e-7). Noted as a follow-up to loosen the assertion to a tight tolerance; coalescing was reverted purely because it gives ~0 scaling benefit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:42:13 +08:00
Gahow Wang	84092fb28d	docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11) T11 set out to coalesce/overlap the gradient all-reduce per the original KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch 32, seq 256) falsifies that hypothesis: - grad all-reduce is only ~6-7% of each step; - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the SAME per-rank workload) and dominates; - coalescing the ~150 per-tensor all-reduces into one grouped/flat launch gives ~0 scaling gain AND breaks cross-rank bit-identity (max\|p0-p1\| 0.0 → 1.49e-8), violating the T8 correctness gate — so the coalescing commit (`b8b5821`) was reverted. Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy at a time; CPU not starved; per-thread default stream doesn't help): single-process thread-per-GPU ranks serialize on the single CUDA context's per-op cudaMalloc / driver calls. Fix direction (out of T11 scope): a caching/pool allocator, or process-per-GPU. Recorded in docs/known-issues.md with the measured table; KI-5 stays Open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:40:45 +08:00
Gahow Wang	88c2c15768	Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" This reverts commit `b8b58212dc`.	2026-06-16 09:39:38 +08:00
Gahow Wang	b8b58212dc	dist: coalesce grads into buckets for all-reduce (KI-5) Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls for dim512, DDP's dominant cost after T10's batched forward) with a coalesced bucketed all-reduce: pack grads into a few large contiguous scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/ End), fold the 1/world average into one per-bucket scale, unpack back. The packed buffer is the concatenation of the grad tensors, so NCCL's element-wise sum over a bucket equals the per-tensor sums — bit-identical to the un-bucketed path; only launch/latency overhead is removed. DDP cross-rank param identity + loss-match are preserved. Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:09:44 +08:00
Gahow Wang	a78502e0f0	docs: run v3 — TinyStories, dim512, val 1.30 Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:45 +08:00
Gahow Wang	64b2a8c09e	run: v3 archive + export (dim512, single-GPU batched, val 1.30) v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch), single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry ~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json + model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd diverges on BF16 drift as in v1/v2). best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out 1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward) validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU avoids KI-5. Update docs/runs/README.md comparison table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:36 +08:00
Gahow Wang	9a25616a30	docs: Phase T10 — batched forward docs/09-batched-forward.md: the launch-bound diagnosis recap, the [B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position + causal masking inline in softmax), the attention forward/backward via strided-batched GEMM, autograd implications, the looped-split/merge dead-end post-mortem (1127 tok/s, host round-trips), verification methods + before→after throughput, and the v3 recommendation (per-rank batch 16-32, single/small world until KI-5 bucketed all-reduce lands). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:50 +08:00
Gahow Wang	4ccab0fb42	perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x) Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling") FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU, back-to-back A/B: before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB after (batched): 25627 tok/s (batch16) / 40263 (batch32), util 37% mean / 54% peak, ~10 GB → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%. A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x. The v3 falsification history (larger batch doesn't help a single-seq design) is kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching exposes (eager all-reduce of all params each step) → recorded as KI-5 (bucketed/overlapped all-reduce), out of T10 scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:43 +08:00
Gahow Wang	25b032445d	train: real batched step (drop loop+SUM) Feed a real batch of B sequences as ONE batched forward/backward, replacing the "loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows is already the batch-mean loss, so backward yields the batch-mean gradient directly → clip pre-scale = 1.0. DDP stays equivalent: each rank runs one batched forward over its b_local = B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average (sum across ranks /world) = Σ_global/B_global = global batch-mean → clip pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way. DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:33 +08:00
Gahow Wang	5353b38402	model: batched forward [B,S] forward_batched(ids[BS], batch)/loss_batched: run B equal-length sequences as ONE forward over flattened [BS] ids, so every linear is one big [BS,dim] GEMM. Attention reshapes to [Bnh,S,hd], runs the fused batched causal SDPA (per-seq mask + RoPE period=S, no cross-sequence attention), writes back [B*S,dim]. The old per-(batch,head) loop + host-round-tripping split/merge_heads + the additive causal_mask leaf are gone. forward(ids[seq]) is now forward_batched(ids,1), so the sampler / inference path (batch=1) is unchanged. +batched_ids_tensor helper. New batched.rs test: batched forward == looped single-sequence (logits identical 0.0, grads 6.4e-4, loss identical). PyTorch parity now exercises B>1 (B=2,S=4): loss 5e-8, logits 6.9e-6, all 25 param grads within rtol — verifying per-seq RoPE position + per-seq causal masking. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:25 +08:00
Gahow Wang	7821bd9c34	autograd: batch dim for ops (flatten linears, batched attention) Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE already act on flat [rows,dim], so they work unchanged on [BS,dim]; only attention + RoPE need sequence awareness: - RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e. per-sequence position on a flattened batch (period == tokens = single seq). - Fused batched causal attention: new `Tensor::attention`/`attention_backward` + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the Bnh (sequence,head) blocks (new sgemm_strided_batched binding) and a causal softmax kernel (scale + per-row causal mask inline) — the whole attention is 3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip. - transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads. grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside the existing 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:15 +08:00
Gahow Wang	d2a585c5cb	docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling v3 tested the documented mitigation (raise global_batch to amortize the per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L, seq256: global_batch 32 (8/rank) → 3163 tok/s global_batch 256 (64/rank)→ 3200 tok/s (8× batch, +1.2%, within noise) 8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is launch-bound: the single-sequence model design (each sequence its own tiny forward/backward, per-op kernel launches) starves the GPU, and batching only adds proportionally more serial launches. Real fix is batched (multi-sequence) forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob. Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is fixed). KI-1 kept Open with the corrected root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:20:26 +08:00
Gahow Wang	bf679f6f1f	docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71) Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:38:31 +08:00
Gahow Wang	c87a0bc44e	docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1 single-GPU. Root cause + proposed fixes recorded; also consolidates deferred T7 items (bf16, activation recompute) and the large-vocab modeling note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:56:58 +08:00
Gahow Wang	7090b475fb	train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch) The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch (scaling-ladder rung), the cached token-id stream (`load_cached`), held-out val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the val corpus and runs the no-grad eval / writes the best checkpoint (params are bit-identical across ranks). The eval/checkpoint logic is reused from `xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated. - DdpConfig gains eval_every / eval_batches / ckpt_path. - train_rank takes `valid: Option<&Corpus>` and returns DdpResult (losses + evals + best_val); launch threads the val corpus to rank 0 only. - bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus + --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/ --val-tokens/--eval-every/--ckpt), reusing the u16 cache. - DDP correctness test updated to the new signatures (semantics unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:34:40 +08:00
Gahow Wang	264660527f	docs: run v1 — TinyStories full, dim256 docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table. v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M). Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:09:46 +08:00
Gahow Wang	ec8114ecbc	train: --eval-ckpt eval-only mode (v0-vs-v1 same-set val loss) Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an existing checkpoint into a model of the given arch and score it on the held-out val split, then exit. Lets v0 and v1 be measured on the identical validation set (the acceptance metric) without a separate eval binary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:44:40 +08:00
Gahow Wang	e44e50ef78	data: full TinyStories + tokenized-id cache, val loss, CLI arch - Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to <corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns. - Corpus::split_tail: hold out a tail slice as a validation corpus. - train(): take an optional valid corpus + eval_every/eval_batches; periodic deterministic val-loss eval that checkpoints the BEST val model; returns TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved. - bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/ --layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/ --eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config. - gitignore the multi-GB corpus + .u16.bin caches + .ckpt (dash5-only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:34:48 +08:00
Gahow Wang	15f1e526c7	train: parameterize model size (scaling ladder) Add Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn) so the model size is a tunable rung instead of a hardcoded tiny config, and Config::core_params() (num_params minus the two vocab×dim tables) — the figure the ladder is sized against (the 50257-vocab embed+lm_head adds a fixed ~25M that is not capacity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 18:34:39 +08:00
Gahow Wang	8981cf7982	docs: T9 verification results (xserv == xtrain, dash5) Capture the closed-loop run: train (loss 10.84->3.59) -> export (47 tensors, BF16) -> xserv dump-logits + greedy. Top-1 + top-11 token order identical, logits within ~1e-2 (BF16-vs-f32 drift), greedy generation token-for-token identical across two prompts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:37:46 +08:00
Gahow Wang	e246c3bec2	export: dump_logits bin for xserv-vs-xtrain comparison xtrain-side top-k next-token logit dump (f32 forward, same model/config/ckpt as the exporter) mirroring xserv's dump-logits, so the closed-loop check can compare both sides numerically for the same prompt + weights. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:36:41 +08:00
Gahow Wang	18c2229b4b	docs: Phase T9 — export to xserv Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the QK-norm structural decision + BF16 acceptance criterion, the tensor-name + layout mapping table, and the dash5 closed-loop verification recipe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:32 +08:00
Gahow Wang	1c76573cb4	export: safetensors + config.json for xserv qwen3 New bin export_safetensors: load an xtrain checkpoint, map every param to its HF Qwen3 tensor name, transpose 2D projection weights [in,out]->[out,in] (1D norms + [vocab,dim] embed/lm_head kept), cast to BF16 (xserv's qwen3 forward is BF16-only), and write config.json + model.safetensors + a copy of the gpt2 tokenizer.json. Sized exactly like bin/train.rs. safetensors 0.5 to match xserv. GPU body gated behind not(no_cuda). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:26 +08:00
Gahow Wang	7a4f69e430	model: add per-head QK-norm (Qwen3-compat) for xserv export xserv's Qwen3 forward unconditionally applies per-head RMSNorm to Q and K (q_norm/k_norm, shape [head_dim]) before RoPE — even gamma=1 is a real RMS divide, not identity. xtrain never had this, so an exact xserv<->xtrain loop was structurally impossible. Add it (reusing the 2D rms_norm op on the [seq*nh, hd] head rows, inserted between reshape and rope to mirror qwen3.rs's order) so the trained model is genuinely Qwen3-compatible. params() inserts q_norm,k_norm after wv; num_params() counts them; the PyTorch parity refs (parity.py / adamw_parity.py) + their name lists add the same step so the dumps stay self-consistent. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:33:19 +08:00
Gahow Wang	ad82e8bf92	dist: lengthen scaling bench so NCCL init amortizes 30-step bench charged the one-time NCCL init + 4 model builds (present at world=4, absent at world=1) against the wall clock, understating steady-state scaling (in-loop tok/s already showed ~53k at 4 GPUs). Bump to 150 steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:18:23 +08:00
Gahow Wang	818f76a18f	dist: drop unused import; relax DDP-vs-single-GPU param tolerance dash5 verify: loss trajectory matches single-GPU to max_rel 1.16e-7 and cross-rank params are bit-identical (0.0), but DDP-vs-single-GPU per-param rel diff is ~2.8e-3 after 20 AdamW steps — expected, since the two differ only in gradient summation order (fp add isn't associative) and that rounding compounds. Bump check (c) 1e-3 -> 1e-2 (a/b stay tight). Also remove an unused DType import. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:17:31 +08:00
Gahow Wang	0131f05b26	docs: Phase T8 — distributed data parallel Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then- local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same init/state), batch sharding math vs single-GPU, verification plan + scaling table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:49 +08:00
Gahow Wang	cf5e3987df	dist: multi-rank launcher + ddp acceptance test bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects the set), NCCL all-reduce gradients each step, train the tiny transformer on TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda build keeps a stub main. tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data -> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:41 +08:00
Gahow Wang	163f567c80	dist: ddp all-reduce + sharded batch DDP training step (train_rank) on top of DdpContext: each rank advances the SAME RNG, draws the whole global batch, and runs forward+backward only on its shard (i % world == rank) so the union over ranks is the single-GPU batch in the same order. After backward, all-reduce-average the device grads, then finish the mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init + same averaged grad + same optimizer state keep params bit-identical across ranks. Adds a deterministic build_model (same LCG init as bin/train) shared by ranks + baseline, a per-step loss all-reduce for the reported global-mean loss, and the thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank builds its model thread-locally, only UniqueId/config/&Corpus cross threads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:15:29 +08:00
Gahow Wang	e27df50ca9	dist: nccl ffi + comm bootstrap New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End}, ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper — rank 0 mints the UniqueId, every rank inits its communicator under a group, and all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs on the null stream so it orders with the model's kernels (no extra barrier). build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate compiles to empty, cargo check passes host-side); with nvcc, links -lnccl -lcudart like xserv-distributed's build.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:14:56 +08:00
Gahow Wang	5e8add2a41	docs: Phase T7 — performance Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd (row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step param/grad roundtrip), drop per-op sync + device memset. Includes the verification table (regression suite green + tok/s 2770→8220 ~3x), the deferred bf16/recompute follow-up rationale, and the T8 all-reduce note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 17:00:29 +08:00
Gahow Wang	a842e432b5	perf: streams / drop per-op sync Default-stream kernels run in order and every host read goes through a stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize after each kernel was pure overhead — remove all 21 in tensor.rs. Host data is still correctly ordered by the D2H memcpy that reads it. Also zero op-output buffers with cudaMemset (device-side, async) instead of a blocking H2D memcpy of a host zero buffer on every allocation — that copy was itself a hidden per-op sync point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:56:17 +08:00
Gahow Wang	8070c1949a	perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:52 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	0e5c7d22e2	perf: cuBLAS matmul fwd/bwd Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so the T3 cuBLAS tolerance and downstream grad-checks are preserved. - cublas.rs: thread-local persistent handle + row-major sgemm helper with transpose flags (col-major⟺row-major as the T3 oracle does). - matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose kernels + their allocations the T3 version ran. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:48:35 +08:00
Gahow Wang	5df1d4d57b	test: resolve real_training corpus default via CARGO_MANIFEST_DIR cargo runs tests with cwd = crate dir, so the bare relative default data/tinystories-valid-3mb.txt didn't resolve. Anchor it to the repo root via CARGO_MANIFEST_DIR so the test runs out of the box (still overridable with XTRAIN_CORPUS). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:41:12 +08:00
Gahow Wang	2f8118fda9	test: tighten AdamW parity (f32 reference, 10 steps, allclose tol) The loss trajectory already matched torch.optim.AdamW (worst relerr ~2e-4), but the float64 torch reference diverged per-weight from the f32 GPU training after the model memorised the batch (flat region: weights underdetermined, loss identical). Fixes: run the torch reference in float32 (match engine precision), shorten to 10 steps (weights still well-determined), and compare final params with an allclose-style rtol+atol metric (a pure relative metric is misleading on near-zero weights). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:34:18 +08:00
Gahow Wang	29b4d30b6c	docs: Phase T6 — training loop Design doc for the T6 training stack: Goal / Module Layout / Key Design Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse, sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host unit tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:14 +08:00
Gahow Wang	22b7434b23	test: AdamW PyTorch parity + checkpoint round-trip + real training Acceptance tests (GPU-gated not(no_cuda), run on dash5): - adamw_parity_dump.rs + adamw_parity.py: build the tiny model with fixed init, run N AdamW steps on a fixed batch, dump the loss trajectory + final params; the Python side rebuilds the identical model and runs torch.optim.AdamW with matched lr/wd/betas/eps, comparing trajectory + final params within rtol. - checkpoint_roundtrip.rs: train a few steps, save, load into a fresh model with a DIFFERENT init, assert identical logits/loss on a fixed input. - real_training.rs (#[ignore], --release): train on TinyStories for a bounded budget; assert loss drops substantially and print greedy samples. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:30:06 +08:00
Gahow Wang	77a82bfeee	train: loop + checkpoint save/load + sampler + train binary Training loop (train_loop.rs): sample batch_size sequences, forward loss + backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints periodically; returns the loss trace. Checkpoint (checkpoint.rs): flat little-endian dump of params() in order (magic/version/count + per-param ndim/dims/f32 data); load_into validates and overwrites a matching model's params via set_value (exact f32 round-trip). Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs forward on the growing prefix (model is single-sequence, RoPE pos=row). bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it buildable on a GPU-less host. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:58 +08:00
Gahow Wang	7d84a64f5c	data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the corpus into one id stream and samples fixed-length (input, target) next-token windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole stories (<\|endoftext\|> boundaries). Also the host-only training math: LrSchedule (linear warmup + cosine decay) and global L2 grad-norm + clip scale, each with a local unit test. Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid (fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution noted: a real TinyStories subset, not the full set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:29:32 +08:00
Gahow Wang	f22429f5b8	optim: hand-written AdamW (decoupled weight decay + bias correction) New xtrain-optim crate. AdamW with per-param m/v moments keyed by params() index, global bias correction, and decoupled weight decay (matches torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers, unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr argument leaves room for an LR schedule. Host unit test checks the update against an independent reference recurrence over 20 steps and the pure-decay (g=0) boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:28:23 +08:00
Gahow Wang	8565565647	docs: Phase T5 — tiny transformer Goal / Module Layout / Key Design Decisions (multi-head layout via reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add, x@W convention, causal mask, params API, overfit methodology) / 验证方法 with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	603c85e1e0	model: silence torch parity warning (read loss before backward) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:09:30 +08:00
Gahow Wang	3366f30c4d	model: PyTorch parity harness (weight dump + equivalent torch model) parity_dump.rs (#[ignore] fixture generator) dumps the model's exact weights, ids, forward logits, loss, and per-param grads after one backward. parity.py rebuilds the IDENTICAL model in PyTorch (same x@W convention, RoPE rotate_half pos=row, RMSNorm, SwiGLU, causal SDPA), runs fwd+bwd, and compares logits + every grad within rtol. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:07:30 +08:00
Gahow Wang	e3912c2380	model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test New crate xtrain-model: a from-scratch decoder built entirely from the autodiff op set. - Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64). - TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal attention (RoPE, additive causal mask, per-head SDPA) -> residual; pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head. x@W weight convention (engine GEMM is plain A@B); dim=n_headshead_dim. - params()/zero_grad-able leaves for the optimizer; param_to_host export. - overfit test: char-level bring-up (embedded text -> vocab -> shifted targets), minimal hand-written GD (p -= lrgrad) memorises one fixed batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd correctness signal. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	0acfa5df11	ops: grad-check the T5 structural ops Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d, and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	777f3c7949	docs: Phase T4 — autograd engine Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00

1 2

63 Commits