data: full TinyStories + tokenized-id cache, val loss, CLI arch

- Corpus::load_cached: tokenize the (large) corpus ONCE, cache the id stream to <corpus>.u16.bin (gpt2 vocab 50257 < 65536 → exact u16), read cache on reruns. - Corpus::split_tail: hold out a tail slice as a validation corpus. - train(): take an optional valid corpus + eval_every/eval_batches; periodic deterministic val-loss eval that checkpoints the BEST val model; returns TrainResult{train_losses, evals, best_val}. T6 fixed-cadence path preserved. - bin/train + bin/export_safetensors: read architecture (--heads/--head-dim/ --layers/--ffn) + opt knobs (--steps/--batch/--seq/--max-lr/--val-tokens/ --eval-every) from CLI flags; defaults reproduce the v0-baseline tiny config. - gitignore the multi-GB corpus + *.u16.bin caches + *.ckpt (dash5-only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:34:48 +08:00
parent 15f1e526c7
commit e44e50ef78
7 changed files with 336 additions and 68 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -9,3 +9,9 @@

 # Claude Code runtime state
 /.claude/
+
+# Large scaling-run corpora + tokenized id caches live on dash5 only, never in
+# git (the small data/tinystories-valid-3mb.txt is committed as a fixture).
+/data/tinystories-train.txt
+*.u16.bin
+*.ckpt