154 Commits

Author SHA1 Message Date
25b032445d train: real batched step (drop loop+SUM)
Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.

DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:33 +08:00
5353b38402 model: batched forward [B,S]
forward_batched(ids[B*S], batch)/loss_batched: run B equal-length sequences as
ONE forward over flattened [B*S] ids, so every linear is one big [B*S,dim] GEMM.
Attention reshapes to [B*nh,S,hd], runs the fused batched causal SDPA (per-seq
mask + RoPE period=S, no cross-sequence attention), writes back [B*S,dim]. The
old per-(batch,head) loop + host-round-tripping split/merge_heads + the additive
causal_mask leaf are gone. forward(ids[seq]) is now forward_batched(ids,1), so
the sampler / inference path (batch=1) is unchanged.

+batched_ids_tensor helper. New batched.rs test: batched forward == looped
single-sequence (logits identical 0.0, grads 6.4e-4, loss identical). PyTorch
parity now exercises B>1 (B=2,S=4): loss 5e-8, logits 6.9e-6, all 25 param
grads within rtol — verifying per-seq RoPE position + per-seq causal masking.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:25 +08:00
7821bd9c34 autograd: batch dim for ops (flatten linears, batched attention)
Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE
already act on flat [rows,dim], so they work unchanged on [B*S,dim]; only
attention + RoPE need sequence awareness:

- RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e.
  per-sequence position on a flattened batch (period == tokens = single seq).
- Fused batched causal attention: new `Tensor::attention`/`attention_backward`
  + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the B*nh
  (sequence,head) blocks (new sgemm_strided_batched binding) and a causal
  softmax kernel (scale + per-row causal mask inline) — the whole attention is
  3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip.
- transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads.

grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all
pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside
the existing 12.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:15 +08:00
d2a585c5cb docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:

  global_batch 32 (8/rank)  → 3163 tok/s
  global_batch 256 (64/rank)→ 3200 tok/s   (8× batch, +1.2%, within noise)

8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:20:26 +08:00
bf679f6f1f docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h
SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M
tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4
RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58
and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the
dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts
(3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat
(global batch too small → all-reduce dominates) → links docs/known-issues
KI-1; v3 proposal applies KI-1's fix (much larger global batch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:38:31 +08:00
c87a0bc44e docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:56:58 +08:00
7090b475fb train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)
The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.

- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
  (losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
  --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
  --val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:34:40 +08:00
264660527f docs: run v1 — TinyStories full, dim256
docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table.
v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M).
Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent
stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical
in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:09:46 +08:00
ec8114ecbc train: --eval-ckpt eval-only mode (v0-vs-v1 same-set val loss)
Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an
existing checkpoint into a model of the given arch and score it on the held-out
val split, then exit. Lets v0 and v1 be measured on the identical validation set
(the acceptance metric) without a separate eval binary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:44:40 +08:00
e44e50ef78 data: full TinyStories + tokenized-id cache, val loss, CLI arch
- Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to
  <corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns.
- Corpus::split_tail: hold out a tail slice as a validation corpus.
- train(): take an optional valid corpus + eval_every/eval_batches; periodic
  deterministic val-loss eval that checkpoints the BEST val model; returns
  TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved.
- bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/
  --layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/
  --eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config.
- gitignore the multi-GB corpus + *.u16.bin caches + *.ckpt (dash5-only).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:34:48 +08:00
15f1e526c7 train: parameterize model size (scaling ladder)
Add Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn) so the model
size is a tunable rung instead of a hardcoded tiny config, and Config::core_params()
(num_params minus the two vocab×dim tables) — the figure the ladder is sized
against (the 50257-vocab embed+lm_head adds a fixed ~25M that is not capacity).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:34:39 +08:00
8981cf7982 docs: T9 verification results (xserv == xtrain, dash5)
Capture the closed-loop run: train (loss 10.84->3.59) -> export (47 tensors,
BF16) -> xserv dump-logits + greedy. Top-1 + top-11 token order identical,
logits within ~1e-2 (BF16-vs-f32 drift), greedy generation token-for-token
identical across two prompts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:37:46 +08:00
e246c3bec2 export: dump_logits bin for xserv-vs-xtrain comparison
xtrain-side top-k next-token logit dump (f32 forward, same model/config/ckpt
as the exporter) mirroring xserv's dump-logits, so the closed-loop check can
compare both sides numerically for the same prompt + weights.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:36:41 +08:00
18c2229b4b docs: Phase T9 — export to xserv
Architecture diff table (xtrain TinyTransformer vs xserv qwen3.rs), the
QK-norm structural decision + BF16 acceptance criterion, the tensor-name +
layout mapping table, and the dash5 closed-loop verification recipe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:33:32 +08:00
1c76573cb4 export: safetensors + config.json for xserv qwen3
New bin export_safetensors: load an xtrain checkpoint, map every param to its
HF Qwen3 tensor name, transpose 2D projection weights [in,out]->[out,in]
(1D norms + [vocab,dim] embed/lm_head kept), cast to BF16 (xserv's qwen3
forward is BF16-only), and write config.json + model.safetensors + a copy of
the gpt2 tokenizer.json. Sized exactly like bin/train.rs. safetensors 0.5 to
match xserv. GPU body gated behind not(no_cuda).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:33:26 +08:00
7a4f69e430 model: add per-head QK-norm (Qwen3-compat) for xserv export
xserv's Qwen3 forward unconditionally applies per-head RMSNorm to Q and K
(q_norm/k_norm, shape [head_dim]) before RoPE — even gamma=1 is a real RMS
divide, not identity. xtrain never had this, so an exact xserv<->xtrain loop
was structurally impossible. Add it (reusing the 2D rms_norm op on the
[seq*nh, hd] head rows, inserted between reshape and rope to mirror
qwen3.rs's order) so the trained model is genuinely Qwen3-compatible.

params() inserts q_norm,k_norm after wv; num_params() counts them; the
PyTorch parity refs (parity.py / adamw_parity.py) + their name lists add the
same step so the dumps stay self-consistent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:33:19 +08:00
ad82e8bf92 dist: lengthen scaling bench so NCCL init amortizes
30-step bench charged the one-time NCCL init + 4 model builds (present at world=4,
absent at world=1) against the wall clock, understating steady-state scaling
(in-loop tok/s already showed ~53k at 4 GPUs). Bump to 150 steps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:18:23 +08:00
818f76a18f dist: drop unused import; relax DDP-vs-single-GPU param tolerance
dash5 verify: loss trajectory matches single-GPU to max_rel 1.16e-7 and
cross-rank params are bit-identical (0.0), but DDP-vs-single-GPU per-param rel
diff is ~2.8e-3 after 20 AdamW steps — expected, since the two differ only in
gradient summation order (fp add isn't associative) and that rounding compounds.
Bump check (c) 1e-3 -> 1e-2 (a/b stay tight). Also remove an unused DType import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:17:31 +08:00
0131f05b26 docs: Phase T8 — distributed data parallel
Design doc for the NCCL DDP path: comm bootstrap (rank-0 UniqueId + grouped
CommInitRank), thread-per-GPU launch model (Var is !Send), all-reduce-then-
local-step scheme (in-place fp32 AllReduce on .grad() + /world, each rank steps
its own GpuAdamW), why params stay consistent (NCCL bit-identical reduce + same
init/state), batch sharding math vs single-GPU, verification plan + scaling
table. Lists TP/PP/ZeRO/bf16-comm as out-of-scope follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:49 +08:00
cf5e3987df dist: multi-rank launcher + ddp acceptance test
bin/train_ddp: spawn one thread per visible GPU (CUDA_VISIBLE_DEVICES selects
the set), NCCL all-reduce gradients each step, train the tiny transformer on
TinyStories; doubles as the throughput driver (prints global tok/s). no_cuda
build keeps a stub main.

tests/ddp_correctness: (1) 2-rank DDP vs single-GPU over the same synthetic data
-> loss trajectory max_rel < 1e-3, cross-rank params bit-identical (==0.0), DDP
vs single-GPU params rel < 1e-3; (2) 1/2/4-GPU throughput table on a fixed
per-GPU workload. Gated #[cfg(not(no_cuda))], auto-skips with < 2 GPUs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:41 +08:00
163f567c80 dist: ddp all-reduce + sharded batch
DDP training step (train_rank) on top of DdpContext: each rank advances the
SAME RNG, draws the whole global batch, and runs forward+backward only on its
shard (i % world == rank) so the union over ranks is the single-GPU batch in the
same order. After backward, all-reduce-average the device grads, then finish the
mean with clip(pre_scale = 1/b_local) -> Sigma_global/B_global, identical to the
single-GPU clip(1/B). Each rank then runs its own GpuAdamW.step; same init +
same averaged grad + same optimizer state keep params bit-identical across ranks.

Adds a deterministic build_model (same LCG init as bin/train) shared by ranks +
baseline, a per-step loss all-reduce for the reported global-mean loss, and the
thread-per-GPU launch() helper (thread::scope; Var graph is !Send so each rank
builds its model thread-locally, only UniqueId/config/&Corpus cross threads).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:15:29 +08:00
e27df50ca9 dist: nccl ffi + comm bootstrap
New crate xtrain-distributed (mirrors xserv-distributed): hand-written NCCL
FFI (GetUniqueId / CommInitRank / AllReduce / CommDestroy / Group{Start,End},
ncclUniqueId passed by value per the NCCL ABI) and a safe DdpContext wrapper —
rank 0 mints the UniqueId, every rank inits its communicator under a group, and
all_reduce_average_grads in-place AllReduce(sum)s each param's .grad() device
buffer then scales by 1/world (reuses T7's scale_inplace kernel). AllReduce runs
on the null stream so it orders with the model's kernels (no extra barrier).

build.rs follows the per-crate convention: no nvcc -> no_cuda cfg (crate
compiles to empty, cargo check passes host-side); with nvcc, links -lnccl
-lcudart like xserv-distributed's build.rs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:14:56 +08:00
5e8add2a41 docs: Phase T7 — performance
Design doc for the T7 fp32-preserving speedups: cuBLAS matmul fwd/bwd
(row-major⟺col-major layout), GPU AdamW + GPU grad-norm (no per-step
param/grad roundtrip), drop per-op sync + device memset. Includes the
verification table (regression suite green + tok/s 2770→8220 ~3x), the
deferred bf16/recompute follow-up rationale, and the T8 all-reduce note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:00:29 +08:00
a842e432b5 perf: streams / drop per-op sync
Default-stream kernels run in order and every host read goes through a
stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize
after each kernel was pure overhead — remove all 21 in tensor.rs. Host
data is still correctly ordered by the D2H memcpy that reads it.

Also zero op-output buffers with cudaMemset (device-side, async) instead of
a blocking H2D memcpy of a host zero buffer on every allocation — that
copy was itself a hidden per-op sync point.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:56:17 +08:00
8070c1949a perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:53:52 +08:00
b0e397ca81 perf: GPU AdamW + grad-norm
Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.

- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
  (block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
  each param's .grad() and rewriting the param buffer in place — no host
  roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
  comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:53:09 +08:00
0e5c7d22e2 perf: cuBLAS matmul fwd/bwd
Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of
the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so
the T3 cuBLAS tolerance and downstream grad-checks are preserved.

- cublas.rs: thread-local persistent handle + row-major sgemm helper with
  transpose flags (col-major⟺row-major as the T3 oracle does).
- matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose
  kernels + their allocations the T3 version ran.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:48:35 +08:00
5df1d4d57b test: resolve real_training corpus default via CARGO_MANIFEST_DIR
cargo runs tests with cwd = crate dir, so the bare relative default
data/tinystories-valid-3mb.txt didn't resolve. Anchor it to the repo root via
CARGO_MANIFEST_DIR so the test runs out of the box (still overridable with
XTRAIN_CORPUS).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:41:12 +08:00
2f8118fda9 test: tighten AdamW parity (f32 reference, 10 steps, allclose tol)
The loss trajectory already matched torch.optim.AdamW (worst relerr ~2e-4),
but the float64 torch reference diverged per-weight from the f32 GPU training
after the model memorised the batch (flat region: weights underdetermined,
loss identical). Fixes: run the torch reference in float32 (match engine
precision), shorten to 10 steps (weights still well-determined), and compare
final params with an allclose-style rtol+atol metric (a pure relative metric is
misleading on near-zero weights).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:34:18 +08:00
29b4d30b6c docs: Phase T6 — training loop
Design doc for the T6 training stack: Goal / Module Layout / Key Design
Decisions (AdamW math + decoupled WD, LR schedule, global-norm grad clip with
batch averaging, checkpoint format, data pipeline + xserv tokenizer reuse,
sampler) / 验证方法 (AdamW parity, checkpoint round-trip, real training, host
unit tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:30:14 +08:00
22b7434b23 test: AdamW PyTorch parity + checkpoint round-trip + real training
Acceptance tests (GPU-gated not(no_cuda), run on dash5):
- adamw_parity_dump.rs + adamw_parity.py: build the tiny model with fixed init,
  run N AdamW steps on a fixed batch, dump the loss trajectory + final params;
  the Python side rebuilds the identical model and runs torch.optim.AdamW with
  matched lr/wd/betas/eps, comparing trajectory + final params within rtol.
- checkpoint_roundtrip.rs: train a few steps, save, load into a fresh model with
  a DIFFERENT init, assert identical logits/loss on a fixed input.
- real_training.rs (#[ignore], --release): train on TinyStories for a bounded
  budget; assert loss drops substantially and print greedy samples.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:30:06 +08:00
77a82bfeee train: loop + checkpoint save/load + sampler + train binary
Training loop (train_loop.rs): sample batch_size sequences, forward loss +
backward (tape SUMs grads), clip_grad_norm with ×1/batch averaging, AdamW step
with scheduled lr, zero_grad; logs loss/lr/gnorm/tok-s and checkpoints
periodically; returns the loss trace.

Checkpoint (checkpoint.rs): flat little-endian dump of params() in order
(magic/version/count + per-param ndim/dims/f32 data); load_into validates and
overwrites a matching model's params via set_value (exact f32 round-trip).

Sampler (sample.rs): autoregressive greedy / temperature generation — re-runs
forward on the growing prefix (model is single-sequence, RoPE pos=row).

bin/train.rs: end-to-end entry — load tokenizer+corpus, train a tiny 4-layer
model for a bounded budget, checkpoint, print samples. no_cuda stub keeps it
buildable on a GPU-less host.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:29:58 +08:00
7d84a64f5c data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip
New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch
GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves
on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the
corpus into one id stream and samples fixed-length (input, target) next-token
windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole
stories (<|endoftext|> boundaries).

Also the host-only training math: LrSchedule (linear warmup + cosine decay)
and global L2 grad-norm + clip scale, each with a local unit test.

Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid
(fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution
noted: a real TinyStories subset, not the full set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:29:32 +08:00
f22429f5b8 optim: hand-written AdamW (decoupled weight decay + bias correction)
New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.

Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:28:23 +08:00
8565565647 docs: Phase T5 — tiny transformer
Goal / Module Layout / Key Design Decisions (multi-head layout via
reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add,
x@W convention, causal mask, params API, overfit methodology) / 验证方法
with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:09:30 +08:00
603c85e1e0 model: silence torch parity warning (read loss before backward)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:09:30 +08:00
3366f30c4d model: PyTorch parity harness (weight dump + equivalent torch model)
parity_dump.rs (#[ignore] fixture generator) dumps the model's exact
weights, ids, forward logits, loss, and per-param grads after one
backward. parity.py rebuilds the IDENTICAL model in PyTorch (same x@W
convention, RoPE rotate_half pos=row, RMSNorm, SwiGLU, causal SDPA),
runs fwd+bwd, and compares logits + every grad within rtol.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:07:30 +08:00
e3912c2380 model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test
New crate xtrain-model: a from-scratch decoder built entirely from the
autodiff op set.
- Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64).
- TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal
  attention (RoPE, additive causal mask, per-head SDPA) -> residual;
  pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head.
  x@W weight convention (engine GEMM is plain A@B); dim=n_heads*head_dim.
- params()/zero_grad-able leaves for the optimizer; param_to_host export.
- overfit test: char-level bring-up (embedded text -> vocab -> shifted
  targets), minimal hand-written GD (p -= lr*grad) memorises one fixed
  batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd
  correctness signal. Gated #![cfg(not(no_cuda))].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:20 +08:00
0acfa5df11 ops: grad-check the T5 structural ops
Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for
embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d,
and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:20 +08:00
7fb1a29057 ops: embedding/reshape/transpose/split-merge-heads fwd+bwd
Phase T5 structural ops on top of the T4 set, needed to assemble the
tiny transformer:
- embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward
  (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi.
- reshape: contiguous metadata-only view (Tensor::reshape), no kernel.
- transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel).
- autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus
  split_heads (->Vec<Var>) / merge_heads for per-head attention.
- tape: Var::zero_grad + set_value so a hand-written GD step can update
  params and clear grads between steps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:09 +08:00
777f3c7949 docs: Phase T4 — autograd engine
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
e7ce504b1f ops: differentiable autograd nodes + per-op grad-check tests
ops.rs wraps each Tensor op as a Var node with its backward closure (forward
caches captured by move). swiglu = mul(silu(gate), up); attention is composed
(matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks
every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test
(dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds
xtrain-cuda dev-dep for device selection in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
224f750ee4 autograd: tape engine + grad accumulation
Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad +
parents + backward closure. backward() seeds a scalar loss, walks reverse
topo order, and pushes grads to parents. push_grad always SUMs into the grad
slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the
no_cuda cfg (does not propagate); engine gated, grad_check stays host-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:44:17 +08:00
5aef3742d6 ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers
add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy,
each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block
reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor
gains the matching thin wrappers; DType grows I32 for cross-entropy targets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:44:09 +08:00
88fbe0a85d gemm: realistic f32 tolerances in GEMM acceptance tests
Forward: compare via matrix relative error (max abs error / max|ref|)
instead of a per-element ratio, so near-zero outputs where two correct
f32 GEMMs differ only in rounding order don't inflate the metric.
Backward: L = sum(W∘C) is bilinear, so central differences are
truncation-free — use eps=1e-2 (sharper f32 resolution of the
difference) and atol=1e-3 to floor near-zero-gradient subtraction noise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:28:57 +08:00
dde2fde297 docs: Phase T3 — GEMM fwd/bwd + finite-diff
Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:27:03 +08:00
1384044f27 gemm: GPU acceptance tests vs cuBLAS + finite-diff
Forward: hand-written tiled GEMM vs cuBLAS sgemm on random matrices
(square / non-tile-aligned rect / 256³), max relative error < 1e-3, using
the row-major⟺col-major identity to drive cuBLAS without explicit
transposes. Backward: scalar loss L = sum(W∘C) (so dC = W), dA/dB from
matmul_backward checked against the finite-diff harness. Gated behind
not(no_cuda).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:58 +08:00
08c88bf360 gemm: tiled F32 forward + transpose + backward (dA/dB)
Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate,
boundary-masked) plus an out-of-place transpose kernel. Wire both through
xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level:
Tensor::matmul, transpose_2d, and matmul_backward computing
dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the
forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a
correctness reference in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:51 +08:00
9ca98efd98 autodiff: finite-diff gradient-check harness
New xtrain-autodiff crate with a reusable central finite-difference
gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an
analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative
tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so
T4's per-op backward checks can reuse it directly. Includes host unit
tests (sum(x²) grad 2x passes; a wrong grad is rejected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:42 +08:00
fbd07a578c tensor: minimal Tensor crate over xtrain-cuda
New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted
host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with
creation, host↔device transfer, and a scale() op driving the elementwise
kernel. GPU integration tests (host↔device roundtrip + scale correctness)
gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the
kernel call sites compile out locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:13:06 +08:00