xtrain/Cargo.toml at 7d84a64f5cef417dce374b6ee97b52105a7a90fe - xtrain - Local Gitea

gahow/xtrain

Files

Gahow Wang 7d84a64f5c data: gpt2 bpe via xserv-tokenizer + TinyStories corpus + lr schedule + grad clip

New xtrain-train crate scaffold. Data pipeline reuses xserv's from-scratch
GPT-2/Qwen BPE via a path-dep (../../../xserv/crates/xserv-tokenizer, resolves
on both ~/projects and dash5 /opt/wjh/projects): Corpus::load tokenizes the
corpus into one id stream and samples fixed-length (input, target) next-token
windows (LCG-seeded, reproducible). Trims a range-downloaded file to whole
stories (<|endoftext|> boundaries).

Also the host-only training math: LrSchedule (linear warmup + cosine decay)
and global L2 grad-norm + clip scale, each with a local unit test.

Corpus: data/tinystories-valid-3mb.txt — first ~3MB of TinyStories-valid
(fetched on dash5 via hf-mirror.com; HF direct unreachable). Substitution
noted: a real TinyStories subset, not the full set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:29:32 +08:00

20 lines

330 B

TOML

Raw Blame History

 [workspace]
 resolver = "2"
 members = [
     "crates/xtrain-cuda",
     "crates/xtrain-tensor",
     "crates/xtrain-autodiff",
     "crates/xtrain-model",
     "crates/xtrain-optim",
     "crates/xtrain-train",
 ]
 [workspace.package]
 version = "0.1.0"
 edition = "2024"
 license = "MIT"
 [workspace.dependencies]
 half = "2"
 smallvec = "1"