v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu
sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams
the parquet text column row-group by row-group and joins docs with
<|endoftext|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles
it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>