Files

Gahow Wang 5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases

Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 16:18:48 +08:00

7.1 KiB

Raw Blame History

Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document

Goal

v9 证明了双轴 scale（更大模型 + 更多新 token）有效：best val 从 v8 的 2.9801 降到 2.8854。但 v9 的数据量只有 6.013B token，D/N 约 16.8，低于 Chinchilla 经验里的 20。v10 的目标很窄：

只补数据轴：补上 v9 中断的 FineWeb-edu shard010，把 cache 从 6.013B 推到 6.765B。
架构不变：完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
验证边际：看 D/N 从 16.8 到 18.95 是否还能显著降低 val。

Data

项	值
来源	FineWeb-edu `sample/10BT`，shards 000-010
token cache	`data/fineweb-edu.txt.u16.bin`
总 token	6,765,333,808
held-out val	末尾 1,000,000 token
train corpus	6,764,333,808 token
训练消费 token	6,764,298,240 = 103215 steps x effective batch 256 x seq 256
epoch	~1.00

Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后，v10 的 val tail 和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一验证集上的横比。

为了解决这个问题，本轮创建了固定 eval set：

/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin

它包含 shard010 末尾 11M token；前 10M token 只是为了复用现有 split_tail(val_tokens=1M)，真正 eval 的是最后 1M token。该 fixed eval v1 对 v6-v9 都是未见数据；对 v10 也是训练时 held-out。

Architecture

v10 与 v9 完全相同：

项	值
dim	1280
layers	18
query heads x head_dim	40 x 32
kv heads	10 (true GQA, group=4)
ffn	4096
core params	356.89M
total params	485.55M
export tensors	201

Training

项	值
optimizer	hand-written AdamW, wd=0.1
schedule	warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5
grad clip	global norm 1.0
steps	103215
effective global batch	256 (`--batch 128 --accum-steps 2`)
seq_len	256
precision	bf16 mixed precision, fp32 master
memory stack	activation recompute + flash-attention + gradient accumulation
world size	8 x RTX 5090
wall clock	23h51m
steady throughput	~79.0K tok/s
peak observed memory	~17GB / GPU

Command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
  /opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
  --heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
  --steps 103215 --batch 128 --accum-steps 2 --seq 256 \
  --max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
  --eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
  --ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt

Results

train loss: 11.1575 -> 2.9000
first val: step 999 = 5.3048
best val: step 103214 = 2.8816
final val: step 103214 = 2.8816
exit code: 0

FineWeb moving-tail val milestones:

step	999	9999	19999	29999	39999	49999	59999	69999	79999	89999	99999	final
val	5.3048	3.5622	3.3282	3.2450	3.1886	3.1342	3.0714	3.0202	2.9724	2.9236	2.8950	2.8816

The curve still improved at the final eval. There is no overfit signal in this run.

Fixed Eval V1

Fixed eval v1 (shard010 tail 1M, seq256, 64 eval batches):

version	fixed eval v1
v6	3.2328
v7	3.1850
v8	3.1515
v9	2.9278
v10	2.8814

This is the cleanest cross-version result in the v10 round. It says:

v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.

Decoding

Greedy decoding still repeats. Fixed prompts from xtrain:

[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...

Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:

[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...

xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:

[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...

Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or xserv-cli path used for weight validation. A clean next step is to add a raw generation tool with temperature/top-p/repetition-penalty so decoding experiments do not depend on chat templates.

xserv Validation

Registry path:

/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765

Files:

config.json
model.safetensors (BF16, 201 tensors, 927MB)
tokenizer.json
xtrain.ckpt (fp32 master checkpoint, 1.9GB)

xserv loads v10 as true GQA:

Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).

v11 Feasibility: Bigger Model + Longer Context

A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.

Candidate:

item	value
dim / layers	1536 / 20
heads / kv_heads	48 / 12
ffn	6144
core / total params	684.26M / 838.65M
stack	bf16 + recompute + flash + accum + 8 GPU DDP

Smoke results:

seq	batch / accum	effective batch	peak mem	tok/s	result
512	64 / 4	256	30530 MiB	44.7K	50 steps OK
1024	32 / 8	256	30530 MiB	31.0K	20 steps OK

Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:

seq512: roughly 42h
seq1024: roughly 61h

Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:

v11a practical: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
v11b long-context: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.

For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.

7.1 KiB Raw Blame History Unescape Escape