New crate xtrain-model: a from-scratch decoder built entirely from the
autodiff op set.
- Config (tiny: dim=32, 2 layers, 2 heads, head_dim=16, ffn=64).
- TinyTransformer: embedding -> N x {pre-RMSNorm -> multi-head causal
attention (RoPE, additive causal mask, per-head SDPA) -> residual;
pre-RMSNorm -> SwiGLU MLP -> residual} -> final RMSNorm -> LM head.
x@W weight convention (engine GEMM is plain A@B); dim=n_heads*head_dim.
- params()/zero_grad-able leaves for the optimizer; param_to_host export.
- overfit test: char-level bring-up (embedded text -> vocab -> shifted
targets), minimal hand-written GD (p -= lr*grad) memorises one fixed
batch -> loss ~0 + greedy argmax matches targets. End-to-end fwd+bwd
correctness signal. Gated #![cfg(not(no_cuda))].
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 lines
318 B
TOML
13 lines
318 B
TOML
[package]
|
|
name = "xtrain-model"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
|
|
[dependencies]
|
|
xtrain-tensor = { path = "../xtrain-tensor" }
|
|
xtrain-autodiff = { path = "../xtrain-autodiff" }
|
|
|
|
[dev-dependencies]
|
|
# Acceptance tests drive the GPU (device selection) directly.
|
|
xtrain-cuda = { path = "../xtrain-cuda" }
|