Files
xtrain/docs/runs/10-v10-fineweb-edu-dim1280-gqa-data6765.md
Gahow Wang 5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00

201 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document
## Goal
v9 证明了双轴 scale更大模型 + 更多新 token有效best val 从 v8 的 2.9801 降到 2.8854。
但 v9 的数据量只有 6.013B tokenD/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄:
1. **只补数据轴**:补上 v9 中断的 FineWeb-edu shard010把 cache 从 6.013B 推到 6.765B。
2. **架构不变**:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
3. **验证边际**:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。
## Data
| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`shards 000-010 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,765,333,808** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,764,333,808 token |
| 训练消费 token | **6,764,298,240** = 103215 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |
Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后v10 的 val tail
和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一
验证集上的横比。
为了解决这个问题,本轮创建了固定 eval set
```text
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
```
它包含 shard010 末尾 11M token前 10M token 只是为了复用现有 `split_tail(val_tokens=1M)`,真正 eval 的是最后
1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。
## Architecture
v10 与 v9 完全相同:
| 项 | 值 |
|----|----|
| dim | 1280 |
| layers | 18 |
| query heads x head_dim | 40 x 32 |
| kv heads | 10 (true GQA, group=4) |
| ffn | 4096 |
| core params | 356.89M |
| total params | 485.55M |
| export tensors | 201 |
## Training
| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **103215** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **23h51m** |
| steady throughput | **~79.0K tok/s** |
| peak observed memory | ~17GB / GPU |
Command:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
--steps 103215 --batch 128 --accum-steps 2 --seq 256 \
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
--ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt
```
## Results
- train loss: **11.1575 -> 2.9000**
- first val: step 999 = **5.3048**
- best val: step 103214 = **2.8816**
- final val: step 103214 = **2.8816**
- exit code: **0**
FineWeb moving-tail val milestones:
| step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final |
|------|-----|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | **2.8816** |
The curve still improved at the final eval. There is no overfit signal in this run.
## Fixed Eval V1
Fixed eval v1 (`shard010 tail 1M`, seq256, 64 eval batches):
| version | fixed eval v1 |
|---------|---------------|
| v6 | 3.2328 |
| v7 | 3.1850 |
| v8 | 3.1515 |
| v9 | 2.9278 |
| **v10** | **2.8814** |
This is the cleanest cross-version result in the v10 round. It says:
- v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
- v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
- The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.
## Decoding
Greedy decoding still repeats. Fixed prompts from xtrain:
```text
[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...
```
Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:
```text
[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...
```
xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:
```text
[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...
```
Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and
repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or `xserv-cli` path used for weight validation. A clean
next step is to add a raw generation tool with `temperature/top-p/repetition-penalty` so decoding experiments do not depend on chat
templates.
## xserv Validation
Registry path:
```text
/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765
```
Files:
- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
xserv loads v10 as true GQA:
```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```
## v11 Feasibility: Bigger Model + Longer Context
A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.
Candidate:
| item | value |
|------|-------|
| dim / layers | 1536 / 20 |
| heads / kv_heads | 48 / 12 |
| ffn | 6144 |
| core / total params | 684.26M / 838.65M |
| stack | bf16 + recompute + flash + accum + 8 GPU DDP |
Smoke results:
| seq | batch / accum | effective batch | peak mem | tok/s | result |
|-----|---------------|-----------------|----------|-------|--------|
| 512 | 64 / 4 | 256 | **30530 MiB** | **44.7K** | 50 steps OK |
| 1024 | 32 / 8 | 256 | **30530 MiB** | **31.0K** | 20 steps OK |
Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:
- seq512: roughly **42h**
- seq1024: roughly **61h**
Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:
1. **v11a practical**: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
2. **v11b long-context**: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.
For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.