data: FineWeb-edu parquet->txt prep script (Scaling v6)

v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams the parquet text column row-group by row-group and joins docs with <|endoftext|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:04:45 +08:00
parent 579365f4a0
commit 7e5ea9976b
2 changed files with 61 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -13,5 +13,7 @@
 # Large scaling-run corpora + tokenized id caches live on dash5 only, never in
 # git (the small data/tinystories-valid-3mb.txt is committed as a fixture).
 /data/tinystories-train.txt
+/data/fineweb-edu.txt
+/data/*.parquet
 *.u16.bin
 *.ckpt