Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.1 KiB
Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document
Goal
v9 证明了双轴 scale(更大模型 + 更多新 token)有效:best val 从 v8 的 2.9801 降到 2.8854。 但 v9 的数据量只有 6.013B token,D/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄:
- 只补数据轴:补上 v9 中断的 FineWeb-edu shard010,把 cache 从 6.013B 推到 6.765B。
- 架构不变:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
- 验证边际:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。
Data
| 项 | 值 |
|---|---|
| 来源 | FineWeb-edu sample/10BT,shards 000-010 |
| token cache | data/fineweb-edu.txt.u16.bin |
| 总 token | 6,765,333,808 |
| held-out val | 末尾 1,000,000 token |
| train corpus | 6,764,333,808 token |
| 训练消费 token | 6,764,298,240 = 103215 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |
Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后,v10 的 val tail 和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一 验证集上的横比。
为了解决这个问题,本轮创建了固定 eval set:
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
它包含 shard010 末尾 11M token;前 10M token 只是为了复用现有 split_tail(val_tokens=1M),真正 eval 的是最后
1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。
Architecture
v10 与 v9 完全相同:
| 项 | 值 |
|---|---|
| dim | 1280 |
| layers | 18 |
| query heads x head_dim | 40 x 32 |
| kv heads | 10 (true GQA, group=4) |
| ffn | 4096 |
| core params | 356.89M |
| total params | 485.55M |
| export tensors | 201 |
Training
| 项 | 值 |
|---|---|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | 103215 |
| effective global batch | 256 (--batch 128 --accum-steps 2) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | 23h51m |
| steady throughput | ~79.0K tok/s |
| peak observed memory | ~17GB / GPU |
Command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
--steps 103215 --batch 128 --accum-steps 2 --seq 256 \
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
--ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt
Results
- train loss: 11.1575 -> 2.9000
- first val: step 999 = 5.3048
- best val: step 103214 = 2.8816
- final val: step 103214 = 2.8816
- exit code: 0
FineWeb moving-tail val milestones:
| step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | 2.8816 |
The curve still improved at the final eval. There is no overfit signal in this run.
Fixed Eval V1
Fixed eval v1 (shard010 tail 1M, seq256, 64 eval batches):
| version | fixed eval v1 |
|---|---|
| v6 | 3.2328 |
| v7 | 3.1850 |
| v8 | 3.1515 |
| v9 | 2.9278 |
| v10 | 2.8814 |
This is the cleanest cross-version result in the v10 round. It says:
- v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
- v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
- The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.
Decoding
Greedy decoding still repeats. Fixed prompts from xtrain:
[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...
Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:
[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...
xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:
[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...
Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and
repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or xserv-cli path used for weight validation. A clean
next step is to add a raw generation tool with temperature/top-p/repetition-penalty so decoding experiments do not depend on chat
templates.
xserv Validation
Registry path:
/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765
Files:
config.jsonmodel.safetensors(BF16, 201 tensors, 927MB)tokenizer.jsonxtrain.ckpt(fp32 master checkpoint, 1.9GB)
xserv loads v10 as true GQA:
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
v11 Feasibility: Bigger Model + Longer Context
A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.
Candidate:
| item | value |
|---|---|
| dim / layers | 1536 / 20 |
| heads / kv_heads | 48 / 12 |
| ffn | 6144 |
| core / total params | 684.26M / 838.65M |
| stack | bf16 + recompute + flash + accum + 8 GPU DDP |
Smoke results:
| seq | batch / accum | effective batch | peak mem | tok/s | result |
|---|---|---|---|---|---|
| 512 | 64 / 4 | 256 | 30530 MiB | 44.7K | 50 steps OK |
| 1024 | 32 / 8 | 256 | 30530 MiB | 31.0K | 20 steps OK |
Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:
- seq512: roughly 42h
- seq1024: roughly 61h
Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:
- v11a practical: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
- v11b long-context: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.
For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.