New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the corpus into one id stream and samples fixed-length (input, target) next-token windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole stories (<|endoftext|> boundaries). Also the host-only training math: LrSchedule (linear warmup + cosine decay) and global L2 grad-norm + clip scale, each with a local unit test. Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid (fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution noted: a real TinyStories subset, not the full set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
17 lines
675 B
TOML
17 lines
675 B
TOML
[package]
|
|
name = "xtrain-train"
|
|
version.workspace = true
|
|
edition.workspace = true
|
|
|
|
[dependencies]
|
|
xtrain-tensor = { path = "../xtrain-tensor" }
|
|
xtrain-autodiff = { path = "../xtrain-autodiff" }
|
|
xtrain-model = { path = "../xtrain-model" }
|
|
xtrain-optim = { path = "../xtrain-optim" }
|
|
xtrain-cuda = { path = "../xtrain-cuda" }
|
|
# Reuse xserv's from-scratch GPT-2/Qwen BPE (project decision). This relative
|
|
# path resolves on both ~/projects (local) and /opt/wjh/projects (dash5). The
|
|
# crate inherits xserv's workspace for its own deps (serde/regex) — Cargo reads
|
|
# the target package's workspace, not ours.
|
|
xserv-tokenizer = { path = "../../../xserv/crates/xserv-tokenizer" }
|