Files
xtrain/docs/runs/09-v9-fineweb-edu-dim1280-gqa.md
Gahow Wang 5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00

153 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scaling Run v9: Chinchilla 双轴 — dim1280/18L true GQA(core 356.9M) + FineWeb-edu 6.01B token + Phase-2 stack — Design Document
## Goal
v8 给出的元结论是:单独拨容量轴有用,但只有约 3% 的边际;单独重复旧数据也只有约 1.6% 的边际。要继续明显超过
v8必须把 **模型容量 + 新 token** 一起放大,而不是只拨一根轴。
v9 就是这个双轴点:
1. **模型轴**dim1024/core 226M -> **dim1280/core 356.9M**,同时启用真 GQA40 query heads / 10 kv heads
2. **数据轴**v6-v8 的 2.255B FineWeb 子集 -> **6.013B token**,追加了新 FineWeb-edu shards 003-009。
3. **系统栈**:使用 Phase-2 现代路径:`--flash + --accum-steps + bf16 + recompute + DDP`。dropout 设为 0按标准预训练。
> v9 的 val 仍是 FineWeb-edu 分布,不能和 v0-v5 的 TinyStories val 直接比。注意v9 扩展 cache 后默认
> tail-heldout 已经从 v6-v8 的旧 tail 移到新 shards 末尾;严格横比后续以 fixed eval v1 为准。
## Data
| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`,原 shards 000-002 + 新 shards 003-009 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,013,639,492** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,012,639,492 token |
| 训练消费 token | **6,012,600,320** = 91745 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |
P3-DATA 目标本来是约 7B tokenshard 010 下载 `curl rc=18` 中断,所以最终停在 6.01B。对 core 356.9M 来说,
D/N 约 **16.8 token/param**,低于理想 Chinchilla 20但已经远高于 v8 的约 10.4,是一个干净的双轴 scale 点。
## Architecture
| 项 | v8 | **v9** |
|----|----|----|
| dim | 1024 | **1280** |
| layers | 18 | 18 |
| query heads x head_dim | 32 x 32 | **40 x 32** |
| kv heads | 32 (MHA) | **10 (true GQA, group=4)** |
| ffn | 2730 | **4096** |
| core params | 226.50M | **356.89M** |
| total params | 329.42M | **485.55M** |
| export tensors | 201 | **201** |
`config.json` writes real `num_key_value_heads = 10`, so xserv loads v9 as true GQA rather than MHA.
## Training
| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **91745** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **21h15m** |
| steady throughput | **~78.6K tok/s** |
| peak observed memory | ~17GB / GPU |
Command:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
--steps 91745 --batch 128 --accum-steps 2 --seq 256 \
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
--ckpt /dashscope-tmp/wjh/xtrain_v9.ckpt
```
## Results
- train loss: **11.1550 -> 2.9340**
- first val: step 1000 = **5.1517**
- best val: step 91000 = **2.8854**
- final val: step 91745 = **2.8873**
- exit code: **0**
FineWeb val curve milestones:
| step | 1000 | 10000 | 20000 | 30000 | 40000 | 50000 | 60000 | 70000 | 80000 | 90000 | 91000 | final |
|------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.1517 | 3.4820 | 3.2953 | 3.2026 | 3.1422 | 3.0844 | 3.0148 | 2.9616 | 2.9160 | 2.8915 | **2.8854** | 2.8873 |
The curve kept improving into the last 1K-step window, then the final eval bounced slightly from 2.8854 to 2.8873. This is close to
the floor for this run, but not a clear overfit failure.
## Comparison
| | v6 | v7 | v8 | **v9** |
|---|---|---|---|---|
| model | dim768/core127M | dim768/core127M | dim1024/core226M | **dim1280/core357M + GQA** |
| data | 2.29B | 3.28B same subset | 2.36B same subset | **6.01B expanded shards** |
| best val | 3.0652 | 3.0149 | 2.9801 | **2.8854** |
On the run-local moving tail, v9 beats v8 by **0.0947** val loss (~3.2% relative), essentially the same size as the
v6->v8 capacity gain but now on top of it. A later fixed eval v1 check still supports the same direction
(v8 3.1515 -> v9 2.9278 on shard010-tail holdout), while making the moving-tail caveat explicit. This confirms
the v8 prediction: **双轴 scale 有效**. It is still an incremental gain, not a qualitative jump.
## Samples
xserv greedy samples (`--max-tokens 60`) are more coherent than the v8 examples on some prompts, but repetition remains:
```text
[The history of] the United States is the story of the people, the places, and the events that have shaped the nation...
[In science,] the term "scientific method" is used to describe the process of gathering information and testing it...
[The most important] thing is to be aware of the symptoms and to seek medical attention...
[Water is] a natural resource that is essential for human life...
```
The model writes real explanatory English and the domain mix is FineWeb-like. Greedy decoding still falls into repeated clauses on
some prompts (`scientific method`, symptoms, and earlier fixed prompts), so the val gain is more visible in the metric than in a
dramatic sample-quality leap.
## xserv validation
Registry path:
```text
/opt/wjh/projects/tiny-models/v9-fineweb-edu-dim1280-gqa
```
Files:
- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
- `RUN.md`
xserv loads v9 as:
```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```
Token-match check against xtrain greedy (`max-tokens 40`):
- `Once upon a time`: xtrain and xserv matched through the checked continuation.
- `One day`: diverged after "large, dark," (`very tall man` vs `metallic object`) from BF16 greedy tie sensitivity.
- `The little`: same repetitive pattern, with a short BF16 path divergence.
This is the same class of BF16-vs-f32 greedy drift seen in v8; the important integration result is that xserv successfully loads
true GQA (`kv_heads=10 < heads=40`) and generates from the exported weights.