Files
xtrain/docs/runs/10-v10-fineweb-edu-dim1280-gqa-data6765.md
Gahow Wang 5c27493a90 docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases
Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00

7.1 KiB
Raw Blame History

Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document

Goal

v9 证明了双轴 scale更大模型 + 更多新 token有效best val 从 v8 的 2.9801 降到 2.8854。 但 v9 的数据量只有 6.013B tokenD/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄:

  1. 只补数据轴:补上 v9 中断的 FineWeb-edu shard010把 cache 从 6.013B 推到 6.765B。
  2. 架构不变:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
  3. 验证边际:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。

Data

来源 FineWeb-edu sample/10BTshards 000-010
token cache data/fineweb-edu.txt.u16.bin
总 token 6,765,333,808
held-out val 末尾 1,000,000 token
train corpus 6,764,333,808 token
训练消费 token 6,764,298,240 = 103215 steps x effective batch 256 x seq 256
epoch ~1.00

Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后v10 的 val tail 和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一 验证集上的横比。

为了解决这个问题,本轮创建了固定 eval set

/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin

它包含 shard010 末尾 11M token前 10M token 只是为了复用现有 split_tail(val_tokens=1M),真正 eval 的是最后 1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。

Architecture

v10 与 v9 完全相同:

dim 1280
layers 18
query heads x head_dim 40 x 32
kv heads 10 (true GQA, group=4)
ffn 4096
core params 356.89M
total params 485.55M
export tensors 201

Training

optimizer hand-written AdamW, wd=0.1
schedule warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5
grad clip global norm 1.0
steps 103215
effective global batch 256 (--batch 128 --accum-steps 2)
seq_len 256
precision bf16 mixed precision, fp32 master
memory stack activation recompute + flash-attention + gradient accumulation
world size 8 x RTX 5090
wall clock 23h51m
steady throughput ~79.0K tok/s
peak observed memory ~17GB / GPU

Command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
  /opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
  --heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
  --steps 103215 --batch 128 --accum-steps 2 --seq 256 \
  --max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
  --eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
  --ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt

Results

  • train loss: 11.1575 -> 2.9000
  • first val: step 999 = 5.3048
  • best val: step 103214 = 2.8816
  • final val: step 103214 = 2.8816
  • exit code: 0

FineWeb moving-tail val milestones:

step 999 9999 19999 29999 39999 49999 59999 69999 79999 89999 99999 final
val 5.3048 3.5622 3.3282 3.2450 3.1886 3.1342 3.0714 3.0202 2.9724 2.9236 2.8950 2.8816

The curve still improved at the final eval. There is no overfit signal in this run.

Fixed Eval V1

Fixed eval v1 (shard010 tail 1M, seq256, 64 eval batches):

version fixed eval v1
v6 3.2328
v7 3.1850
v8 3.1515
v9 2.9278
v10 2.8814

This is the cleanest cross-version result in the v10 round. It says:

  • v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
  • v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
  • The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.

Decoding

Greedy decoding still repeats. Fixed prompts from xtrain:

[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...

Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:

[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...

xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:

[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...

Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or xserv-cli path used for weight validation. A clean next step is to add a raw generation tool with temperature/top-p/repetition-penalty so decoding experiments do not depend on chat templates.

xserv Validation

Registry path:

/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765

Files:

  • config.json
  • model.safetensors (BF16, 201 tensors, 927MB)
  • tokenizer.json
  • xtrain.ckpt (fp32 master checkpoint, 1.9GB)

xserv loads v10 as true GQA:

Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).

v11 Feasibility: Bigger Model + Longer Context

A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.

Candidate:

item value
dim / layers 1536 / 20
heads / kv_heads 48 / 12
ffn 6144
core / total params 684.26M / 838.65M
stack bf16 + recompute + flash + accum + 8 GPU DDP

Smoke results:

seq batch / accum effective batch peak mem tok/s result
512 64 / 4 256 30530 MiB 44.7K 50 steps OK
1024 32 / 8 256 30530 MiB 31.0K 20 steps OK

Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:

  • seq512: roughly 42h
  • seq1024: roughly 61h

Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:

  1. v11a practical: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
  2. v11b long-context: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.

For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.