v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams the parquet text column row-group by row-group and joins docs with <|endoftext|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
20 lines
329 B
Plaintext
20 lines
329 B
Plaintext
/target
|
|
*.o
|
|
*.so
|
|
*.a
|
|
*.ptx
|
|
*.cubin
|
|
**/*.rs.bk
|
|
.env
|
|
|
|
# Claude Code runtime state
|
|
/.claude/
|
|
|
|
# Large scaling-run corpora + tokenized id caches live on dash5 only, never in
|
|
# git (the small data/tinystories-valid-3mb.txt is committed as a fixture).
|
|
/data/tinystories-train.txt
|
|
/data/fineweb-edu.txt
|
|
/data/*.parquet
|
|
*.u16.bin
|
|
*.ckpt
|