Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
201 lines
7.1 KiB
Markdown
201 lines
7.1 KiB
Markdown
# Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document
|
||
|
||
## Goal
|
||
|
||
v9 证明了双轴 scale(更大模型 + 更多新 token)有效:best val 从 v8 的 2.9801 降到 2.8854。
|
||
但 v9 的数据量只有 6.013B token,D/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄:
|
||
|
||
1. **只补数据轴**:补上 v9 中断的 FineWeb-edu shard010,把 cache 从 6.013B 推到 6.765B。
|
||
2. **架构不变**:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
|
||
3. **验证边际**:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。
|
||
|
||
## Data
|
||
|
||
| 项 | 值 |
|
||
|----|----|
|
||
| 来源 | FineWeb-edu `sample/10BT`,shards 000-010 |
|
||
| token cache | `data/fineweb-edu.txt.u16.bin` |
|
||
| 总 token | **6,765,333,808** |
|
||
| held-out val | 末尾 **1,000,000** token |
|
||
| train corpus | 6,764,333,808 token |
|
||
| 训练消费 token | **6,764,298,240** = 103215 steps x effective batch 256 x seq 256 |
|
||
| epoch | ~1.00 |
|
||
|
||
Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后,v10 的 val tail
|
||
和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一
|
||
验证集上的横比。
|
||
|
||
为了解决这个问题,本轮创建了固定 eval set:
|
||
|
||
```text
|
||
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
|
||
```
|
||
|
||
它包含 shard010 末尾 11M token;前 10M token 只是为了复用现有 `split_tail(val_tokens=1M)`,真正 eval 的是最后
|
||
1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。
|
||
|
||
## Architecture
|
||
|
||
v10 与 v9 完全相同:
|
||
|
||
| 项 | 值 |
|
||
|----|----|
|
||
| dim | 1280 |
|
||
| layers | 18 |
|
||
| query heads x head_dim | 40 x 32 |
|
||
| kv heads | 10 (true GQA, group=4) |
|
||
| ffn | 4096 |
|
||
| core params | 356.89M |
|
||
| total params | 485.55M |
|
||
| export tensors | 201 |
|
||
|
||
## Training
|
||
|
||
| 项 | 值 |
|
||
|----|----|
|
||
| optimizer | hand-written AdamW, wd=0.1 |
|
||
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
|
||
| grad clip | global norm 1.0 |
|
||
| steps | **103215** |
|
||
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
|
||
| seq_len | 256 |
|
||
| precision | bf16 mixed precision, fp32 master |
|
||
| memory stack | activation recompute + flash-attention + gradient accumulation |
|
||
| world size | 8 x RTX 5090 |
|
||
| wall clock | **23h51m** |
|
||
| steady throughput | **~79.0K tok/s** |
|
||
| peak observed memory | ~17GB / GPU |
|
||
|
||
Command:
|
||
|
||
```sh
|
||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
|
||
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
|
||
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
|
||
--steps 103215 --batch 128 --accum-steps 2 --seq 256 \
|
||
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
|
||
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
|
||
--ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt
|
||
```
|
||
|
||
## Results
|
||
|
||
- train loss: **11.1575 -> 2.9000**
|
||
- first val: step 999 = **5.3048**
|
||
- best val: step 103214 = **2.8816**
|
||
- final val: step 103214 = **2.8816**
|
||
- exit code: **0**
|
||
|
||
FineWeb moving-tail val milestones:
|
||
|
||
| step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final |
|
||
|------|-----|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|
||
| val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | **2.8816** |
|
||
|
||
The curve still improved at the final eval. There is no overfit signal in this run.
|
||
|
||
## Fixed Eval V1
|
||
|
||
Fixed eval v1 (`shard010 tail 1M`, seq256, 64 eval batches):
|
||
|
||
| version | fixed eval v1 |
|
||
|---------|---------------|
|
||
| v6 | 3.2328 |
|
||
| v7 | 3.1850 |
|
||
| v8 | 3.1515 |
|
||
| v9 | 2.9278 |
|
||
| **v10** | **2.8814** |
|
||
|
||
This is the cleanest cross-version result in the v10 round. It says:
|
||
|
||
- v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
|
||
- v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
|
||
- The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.
|
||
|
||
## Decoding
|
||
|
||
Greedy decoding still repeats. Fixed prompts from xtrain:
|
||
|
||
```text
|
||
[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
|
||
[The little] The little boy was a little boy. The little boy was a little boy...
|
||
[One day] I was walking down the street and I saw a man with a dog...
|
||
```
|
||
|
||
Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:
|
||
|
||
```text
|
||
[Once upon a time] I was a kid who did not go to the beach to swim...
|
||
[The little] ones are not as loud as the adults...
|
||
[One day] I was on the edge of the water, and I saw something I had never seen before...
|
||
```
|
||
|
||
xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:
|
||
|
||
```text
|
||
[The history of] the city of San Francisco is a story of the growth of the city...
|
||
[In science,] the term "observation" is used to describe the act of observing something...
|
||
[Water is] the most important element in the human body...
|
||
```
|
||
|
||
Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and
|
||
repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or `xserv-cli` path used for weight validation. A clean
|
||
next step is to add a raw generation tool with `temperature/top-p/repetition-penalty` so decoding experiments do not depend on chat
|
||
templates.
|
||
|
||
## xserv Validation
|
||
|
||
Registry path:
|
||
|
||
```text
|
||
/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765
|
||
```
|
||
|
||
Files:
|
||
|
||
- `config.json`
|
||
- `model.safetensors` (BF16, 201 tensors, 927MB)
|
||
- `tokenizer.json`
|
||
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
|
||
|
||
xserv loads v10 as true GQA:
|
||
|
||
```text
|
||
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
|
||
Loaded 201 tensors
|
||
Ready (KV cache, dtype=bf16).
|
||
```
|
||
|
||
## v11 Feasibility: Bigger Model + Longer Context
|
||
|
||
A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.
|
||
|
||
Candidate:
|
||
|
||
| item | value |
|
||
|------|-------|
|
||
| dim / layers | 1536 / 20 |
|
||
| heads / kv_heads | 48 / 12 |
|
||
| ffn | 6144 |
|
||
| core / total params | 684.26M / 838.65M |
|
||
| stack | bf16 + recompute + flash + accum + 8 GPU DDP |
|
||
|
||
Smoke results:
|
||
|
||
| seq | batch / accum | effective batch | peak mem | tok/s | result |
|
||
|-----|---------------|-----------------|----------|-------|--------|
|
||
| 512 | 64 / 4 | 256 | **30530 MiB** | **44.7K** | 50 steps OK |
|
||
| 1024 | 32 / 8 | 256 | **30530 MiB** | **31.0K** | 20 steps OK |
|
||
|
||
Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:
|
||
|
||
- seq512: roughly **42h**
|
||
- seq1024: roughly **61h**
|
||
|
||
Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:
|
||
|
||
1. **v11a practical**: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
|
||
2. **v11b long-context**: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.
|
||
|
||
For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.
|