Commit Graph

22 Commits

Author SHA1 Message Date
7d84a64f5c data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip
New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch
GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves
on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the
corpus into one id stream and samples fixed-length (input, target) next-token
windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole
stories (<|endoftext|> boundaries).

Also the host-only training math: LrSchedule (linear warmup + cosine decay)
and global L2 grad-norm + clip scale, each with a local unit test.

Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid
(fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution
noted: a real TinyStories subset, not the full set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:29:32 +08:00
f22429f5b8 optim: hand-written AdamW (decoupled weight decay + bias correction)
New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.

Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:28:23 +08:00
8565565647 docs: Phase T5 — tiny transformer
Goal / Module Layout / Key Design Decisions (multi-head layout via
reshape+transpose_3d01+split/merge_heads, embedding gather/scatter-add,
x@W convention, causal mask, params API, overfit methodology) / 验证方法
with the dash5 results (grad-checks, overfit 2.82->0.004, PyTorch parity).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:09:30 +08:00
603c85e1e0 model: silence torch parity warning (read loss before backward)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:09:30 +08:00
3366f30c4d model: PyTorch parity harness (weight dump + equivalent torch model)
parity_dump.rs (#[ignore] fixture generator) dumps the model's exact
weights, ids, forward logits, loss, and per-param grads after one
backward. parity.py rebuilds the IDENTICAL model in PyTorch (same x@W
convention, RoPE rotate_half pos=row, RMSNorm, SwiGLU, causal SDPA),
runs fwd+bwd, and compares logits + every grad within rtol.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:07:30 +08:00
e3912c2380 model: tiny RoPE+RMSNorm+SwiGLU transformer + overfit test
New crate xtrain-model: a from-scratch decoder built entirely from the
autodiff op set.
- Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64).
- TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal
  attention (RoPE, additive causal mask, per-head SDPA) -> residual;
  pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head.
  x@W weight convention (engine GEMM is plain A@B); dim=n_heads*head_dim.
- params()/zero_grad-able leaves for the optimizer; param_to_host export.
- overfit test: char-level bring-up (embedded text -> vocab -> shifted
  targets), minimal hand-written GD (p -= lr*grad) memorises one fixed
  batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd
  correctness signal. Gated #![cfg(not(no_cuda))].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:20 +08:00
0acfa5df11 ops: grad-check the T5 structural ops
Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for
embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d,
and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:20 +08:00
7fb1a29057 ops: embedding/reshape/transpose/split-merge-heads fwd+bwd
Phase T5 structural ops on top of the T4 set, needed to assemble the
tiny transformer:
- embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward
  (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi.
- reshape: contiguous metadata-only view (Tensor::reshape), no kernel.
- transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel).
- autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus
  split_heads (->Vec<Var>) / merge_heads for per-head attention.
- tape: Var::zero_grad + set_value so a hand-written GD step can update
  params and clear grads between steps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:05:09 +08:00
777f3c7949 docs: Phase T4 — autograd engine
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
e7ce504b1f ops: differentiable autograd nodes + per-op grad-check tests
ops.rs wraps each Tensor op as a Var node with its backward closure (forward
caches captured by move). swiglu = mul(silu(gate), up); attention is composed
(matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks
every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test
(dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds
xtrain-cuda dev-dep for device selection in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:53:55 +08:00
224f750ee4 autograd: tape engine + grad accumulation
Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad +
parents + backward closure. backward() seeds a scalar loss, walks reverse
topo order, and pushes grads to parents. push_grad always SUMs into the grad
slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the
no_cuda cfg (does not propagate); engine gated, grad_check stays host-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:44:17 +08:00
5aef3742d6 ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers
add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy,
each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block
reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor
gains the matching thin wrappers; DType grows I32 for cross-entropy targets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:44:09 +08:00
88fbe0a85d gemm: realistic f32 tolerances in GEMM acceptance tests
Forward: compare via matrix relative error (max abs error / max|ref|)
instead of a per-element ratio, so near-zero outputs where two correct
f32 GEMMs differ only in rounding order don't inflate the metric.
Backward: L = sum(W∘C) is bilinear, so central differences are
truncation-free — use eps=1e-2 (sharper f32 resolution of the
difference) and atol=1e-3 to floor near-zero-gradient subtraction noise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:28:57 +08:00
dde2fde297 docs: Phase T3 — GEMM fwd/bwd + finite-diff
Design doc covering the tiled forward, the dA/dB math + how transpose is
handled (materialize + reuse forward), the cuBLAS row-major reference, and
the finite-diff harness design + how T4 reuses it per-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:27:03 +08:00
1384044f27 gemm: GPU acceptance tests vs cuBLAS + finite-diff
Forward: hand-written tiled GEMM vs cuBLAS sgemm on random matrices
(square / non-tile-aligned rect / 256³), max relative error < 1e-3, using
the row-major⟺col-major identity to drive cuBLAS without explicit
transposes. Backward: scalar loss L = sum(W∘C) (so dC = W), dA/dB from
matmul_backward checked against the finite-diff harness. Gated behind
not(no_cuda).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:58 +08:00
08c88bf360 gemm: tiled F32 forward + transpose + backward (dA/dB)
Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate,
boundary-masked) plus an out-of-place transpose kernel. Wire both through
xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level:
Tensor::matmul, transpose_2d, and matmul_backward computing
dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the
forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a
correctness reference in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:51 +08:00
9ca98efd98 autodiff: finite-diff gradient-check harness
New xtrain-autodiff crate with a reusable central finite-difference
gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an
analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative
tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so
T4's per-op backward checks can reuse it directly. Includes host unit
tests (sum(x²) grad 2x passes; a wrong grad is rejected).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:26:42 +08:00
fbd07a578c tensor: minimal Tensor crate over xtrain-cuda
New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted
host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with
creation, host↔device transfer, and a scale() op driving the elementwise
kernel. GPU integration tests (host↔device roundtrip + scale correctness)
gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the
kernel call sites compile out locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:13:06 +08:00
63dc05fd10 tensor: add scale elementwise CUDA kernel + FFI
New csrc/ops/elementwise.cu (out[i]=in[i]*alpha), compiled by
xtrain-cuda/build.rs and exposed via launch_scale_f32 FFI, gated behind
not(no_cuda) like the existing vecadd smoke test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:13:06 +08:00
8557a289a2 docs: Phase T2 — tensor abstraction
Design doc for the minimal tensor layer: DType/shape/Storage/Tensor,
host↔device copy, and one elementwise kernel (scale) wired end-to-end.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:12:55 +08:00
c1b204296b docs: backfill T1 build-chain
T1 shipped without a design doc; capture the Rust↔CUDA build chain
(build.rs+nvcc, no_cuda cfg pattern, RAII GpuBuffer, gitea↔dash5 flow).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:12:55 +08:00
92acf9f413 T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test)
Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's
csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA
Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu
via the cc crate targeting sm_120 (RTX 5090) and links cudart.

A gated integration test runs the vector-add kernel on the GPU and asserts the
result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA
compilation and sets a `no_cuda` cfg so host-side cargo check still works.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 14:42:43 +08:00