# Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document ## Goal v9 证明了双轴 scale(更大模型 + 更多新 token)有效:best val 从 v8 的 2.9801 降到 2.8854。 但 v9 的数据量只有 6.013B token,D/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄: 1. **只补数据轴**:补上 v9 中断的 FineWeb-edu shard010,把 cache 从 6.013B 推到 6.765B。 2. **架构不变**:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。 3. **验证边际**:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。 ## Data | 项 | 值 | |----|----| | 来源 | FineWeb-edu `sample/10BT`,shards 000-010 | | token cache | `data/fineweb-edu.txt.u16.bin` | | 总 token | **6,765,333,808** | | held-out val | 末尾 **1,000,000** token | | train corpus | 6,764,333,808 token | | 训练消费 token | **6,764,298,240** = 103215 steps x effective batch 256 x seq 256 | | epoch | ~1.00 | Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后,v10 的 val tail 和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一 验证集上的横比。 为了解决这个问题,本轮创建了固定 eval set: ```text /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin ``` 它包含 shard010 末尾 11M token;前 10M token 只是为了复用现有 `split_tail(val_tokens=1M)`,真正 eval 的是最后 1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。 ## Architecture v10 与 v9 完全相同: | 项 | 值 | |----|----| | dim | 1280 | | layers | 18 | | query heads x head_dim | 40 x 32 | | kv heads | 10 (true GQA, group=4) | | ffn | 4096 | | core params | 356.89M | | total params | 485.55M | | export tensors | 201 | ## Training | 项 | 值 | |----|----| | optimizer | hand-written AdamW, wd=0.1 | | schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 | | grad clip | global norm 1.0 | | steps | **103215** | | effective global batch | **256** (`--batch 128 --accum-steps 2`) | | seq_len | 256 | | precision | bf16 mixed precision, fp32 master | | memory stack | activation recompute + flash-attention + gradient accumulation | | world size | 8 x RTX 5090 | | wall clock | **23h51m** | | steady throughput | **~79.0K tok/s** | | peak observed memory | ~17GB / GPU | Command: ```sh CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \ /opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \ --heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \ --steps 103215 --batch 128 --accum-steps 2 --seq 256 \ --max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \ --eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \ --ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt ``` ## Results - train loss: **11.1575 -> 2.9000** - first val: step 999 = **5.3048** - best val: step 103214 = **2.8816** - final val: step 103214 = **2.8816** - exit code: **0** FineWeb moving-tail val milestones: | step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final | |------|-----|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | **2.8816** | The curve still improved at the final eval. There is no overfit signal in this run. ## Fixed Eval V1 Fixed eval v1 (`shard010 tail 1M`, seq256, 64 eval batches): | version | fixed eval v1 | |---------|---------------| | v6 | 3.2328 | | v7 | 3.1850 | | v8 | 3.1515 | | v9 | 2.9278 | | **v10** | **2.8814** | This is the cleanest cross-version result in the v10 round. It says: - v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278. - v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814. - The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples. ## Decoding Greedy decoding still repeats. Fixed prompts from xtrain: ```text [Once upon a time] there was a king who had a daughter. She was beautiful and beautiful... [The little] The little boy was a little boy. The little boy was a little boy... [One day] I was walking down the street and I saw a man with a dog... ``` Temperature 0.8 is more varied and less immediately looped, but coherence remains weak: ```text [Once upon a time] I was a kid who did not go to the beach to swim... [The little] ones are not as loud as the adults... [One day] I was on the edge of the water, and I saw something I had never seen before... ``` xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain: ```text [The history of] the city of San Francisco is a story of the growth of the city... [In science,] the term "observation" is used to describe the act of observing something... [Water is] the most important element in the human body... ``` Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or `xserv-cli` path used for weight validation. A clean next step is to add a raw generation tool with `temperature/top-p/repetition-penalty` so decoding experiments do not depend on chat templates. ## xserv Validation Registry path: ```text /opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765 ``` Files: - `config.json` - `model.safetensors` (BF16, 201 tensors, 927MB) - `tokenizer.json` - `xtrain.ckpt` (fp32 master checkpoint, 1.9GB) xserv loads v10 as true GQA: ```text Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257 Loaded 201 tensors Ready (KV cache, dtype=bf16). ``` ## v11 Feasibility: Bigger Model + Longer Context A v11 smoke test prioritized the user's chosen direction: larger model plus longer context. Candidate: | item | value | |------|-------| | dim / layers | 1536 / 20 | | heads / kv_heads | 48 / 12 | | ffn | 6144 | | core / total params | 684.26M / 838.65M | | stack | bf16 + recompute + flash + accum + 8 GPU DDP | Smoke results: | seq | batch / accum | effective batch | peak mem | tok/s | result | |-----|---------------|-----------------|----------|-------|--------| | 512 | 64 / 4 | 256 | **30530 MiB** | **44.7K** | 50 steps OK | | 1024 | 32 / 8 | 256 | **30530 MiB** | **31.0K** | 20 steps OK | Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens: - seq512: roughly **42h** - seq1024: roughly **61h** Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either: 1. **v11a practical**: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10. 2. **v11b long-context**: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory. For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.