Files

Gahow Wang 9c557f0609 docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01)

v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).

Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.

Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-17 03:55:47 +08:00

11 KiB

Raw Blame History

Scaling Run v7: 同子集多 epoch — 同 v6 FineWeb-edu 2.255B 子集 × 1.45 epoch + dim768/18L(同 v4/v5/v6) + 8 卡 DDP bf16 — Design Document

Goal

v6 给了一条很有诱惑力的曲线：纯 FineWeb-edu 才训 1.02 epoch，FineWeb val 到末步仍单调下降（无走平）—— 看上去「这本语料还没喂够」。v6 末尾把 v7 的首选杠杆定为「同 arch 多喂 FineWeb-edu」（更多 epoch），因为这是当时最便宜、最确定的下一步（不动模型尺寸、不补新数据）。

v7 就是去兑现并检验这条判断：架构、语料子集完全冻结，唯一变量 = epoch 数（1.02 → 1.45），看 FineWeb val 还能不能接着降。

架构完全冻结 = v4/v5/v6（dim 768 / 24 heads × 32 head_dim / 18 layers / SwiGLU ffn 2048， core 127.43M，总 204.63M）。一个权重维度都不改，导出的 config.json 与 v6 字节级一致。
数据子集完全冻结 = v6：同一个 2.255B-token FineWeb-edu 子集（sample/10BT 的 3 个 parquet 分片）。 v7 不补任何新分片——这正是本版的核心设定：测「重复喂同一子集的边际收益」，而非「更多样的数据」。
只把 steps 从 35000 拉到 50000（global 256 × seq 256 不变）→ 训练消费 token 从 ~2.29B 涨到 ~3.28B， epoch 占比从 1.02 涨到 1.45。其余超参（lr schedule / clip / bf16 / 8 卡）一字不变。

⚠️ 方法论说明（同 v6）

v7 的 val（FineWeb-edu 3.0149）与 v6（3.0652）同一把尺子、同一个 1M 留出集，可以直接比；但二者都不能和 v0–v5 的 TinyStories val（~1.1）比大小——真实网页文本熵高，~3.0 是预期值不是回退。

数据（v7 与 v6 的唯一差别 = epoch 数）

项	v6	v7
来源	FineWeb-edu `sample/10BT`（真实教育类网页）	同（一字不差的同一子集，非新数据）
语料规模	2,254,904,418 tokens（3 parquet 分片）	2,254,904,418 tokens（同子集）
训练消费 token	~2.29B（35000 步）	~3.28B（50000 步 × global 256 × seq 256）
epoch 占比	~1.02	~1.45
tokenizer	gpt2 BPE（vocab 50257）	同
缓存	`data/fineweb-edu.txt.u16.bin`（4.51GB u16）	同一缓存
held-out val	FineWeb-edu 末尾 1M token	同（与 v6 可比）

缓存-only 加载（v7 的一个工程注脚）：为腾磁盘，冗余的 fineweb-edu.txt（10.45GB）在 v6 后已删，只留 4.51GB 的 .u16.bin 缓存。v7 训练前先验证了 Corpus::load_cached 在缓存命中时提前返回、不读 .txt—— 实测 2.255B token 仅凭缓存加载 OK，零改码。（若缓存缺失才需用 scripts/fineweb_to_txt.py + parquet 重建。）

架构

v7 = 与 v4/v5/v6 字节级同构的 tiny Qwen3（RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head，MHA）。 一个维度都不改，让「epoch 数」成为唯一被测变量。core 127,432,704 / 总 204,627,456，导出 201 tensors， config.json 与 v6 一字不差。

训练器：8 卡 DDP bf16（同 v5/v6）

复用 v5/v6 的训练栈，无改动：fp32 master + AdamW/clip/DDP 全 fp32，linears 走 cublasGemmEx（16BF/fp32 accum）、激活存 bf16；norm/softmax/rope/CE 仍 fp32。8 卡 thread-per-GPU，all-reduce 均值后各 rank 本地 GpuAdamW step，跨 rank 参数 bit-identical。全程稳态 ~218,000 tok/s、wall-clock ~4.2h 训完 3.28B token。

超参

项	值	备注
optimizer	手写 AdamW（GPU 端 step）	wd=0.1，β/eps 用 xtrain-optim 默认
LR schedule	线性 warmup → cosine decay	max_lr 6e-4 → min_lr 6e-5（同 v1–v6）
warmup	~1750 步（steps/20 取整不变量级）	lr 峰值 6e-4，cosine 衰减到末步 6e-5
grad clip	global-norm 1.0	平稳 gnorm ~0.26
steps	50000（v6 是 35000）	~4.2h @ 8 卡
global batch	256（per-rank 32 × world 8）	bf16 甜点区（同 v5/v6）
seq_len	256	同 v2–v6
tokens/step	256×256 = 65536	总训练 token ≈ 3.28B（~1.45 epoch）
world size	8（RTX 5090，sm_120）
精度	bf16 混合精度（fp32 master）	T12/KI-2；导出 xserv 同样 BF16

结果

train loss：start 11.0274 → end 3.0517（全程平滑下降）
best val loss 3.0149（step 48999），final val loss 3.0159（step 49999，FineWeb-edu held-out 1M）
FineWeb val 曲线（抽样）：

step	499	999	3999	7999	11999	15999	19999	25999	31999	37999	43999	48999	49999
val	5.9047	4.9563	3.7424	3.4982	3.3766	3.3078	3.2494	3.1802	3.1232	3.0741	3.0315	3.0149	3.0159

⚠️ 核心发现：同一 FineWeb 子集多 epoch → 边际递减，dim768 近天花板

	v6	v7
epoch	1.02	1.45
训练 token	2.29B	3.28B
best val（FineWeb，可比）	3.0652	3.0149
Δval	—	仅 ↓0.05

把 epoch 从 1.02 拉到 1.45（多喂 ~1B token），FineWeb val 只降了 ~0.05（3.0652 → 3.0149），而且曲线 ~step 44000 后基本走平（3.0315 → 3.0149 → 末步反弹到 3.0159）。

结论：同一个 2.255B FineWeb-edu 子集，多喂 epoch 在 dim768 上已近天花板。 v6 末尾「val 还在单调降 = 还没喂够」的乐观读法，被 v7 校正了：那段单调下降主要是 v6 才训 1 个 epoch、尚在首轮学习；一旦进入第 1.x 个 epoch（开始重复见同样的 token），增益迅速摊薄。真正的「更多数据」必须是新的 FineWeb shards （更多样、不重复的 token），而不是把同一子集再读一遍。

这与 v5 在 TinyStories 上的饱和信号是同一类现象的两条轴：

v5（同子集 ×3.5 数据）：TinyStories 5.33 epoch vs v4 1.54 epoch，val 仅 ↓5% 且走平 = 数据量轴饱和。

v7（同子集 ×1.4 epoch）：FineWeb 1.45 epoch vs v6 1.02 epoch，val 仅 ↓0.05 且走平 = 同子集 epoch 轴饱和。

v6（换语料） 才是真正抬高天花板的轴：换成更广更真的 FineWeb-edu，带来语言种类的质变（小故事 → 真实说明文）。

一句话：「重复喂老数据」（v5/v7）边际都薄；「喂更广的新数据」（v6）才是杠杆。

采样对比 —— v6 vs v7（同 arch、同 xserv、同贪心、同 prompt）

唯一差别是 v7 多训了 ~0.43 epoch。喂同样的通用/说明文提示词：

prompt	v6（1.02ep）	v7（1.45ep）
`The history of`	the United States is a fascinating one… shaped by the experiences of its people…	the city of New York is a story of many different people. The first inhabitants… were the Native Americans… the Dutch…
`In science,`	the term science is used to refer to the study of the physical world… biology, chemistry, physics, and engineering.	the term "science" is used to describe the study of the natural world… biology, chemistry, physics, and mathematics…
`The most important`	thing is to have a good understanding of the different types of data… make decisions about your business.	thing to remember is that you can't just buy a new car and expect to pay for it… understand the basics of insurance…
`Water is`	a natural resource that is used to produce energy… a renewable resource that can be used to generate electricity.	a natural substance that is found in the earth's crust… a very important element in the Earth's ecosystem…

采样质量与 v6 同档——都写连贯的真实说明文英文（历史/科学学科/资源/金融），与 v5 一律掉进小故事形成鲜明对比。v7 措辞略有变化（greedy 路径随权重微移而漂移），但没有可感知的质的提升——这正是 val 仅 ↓0.05 在采样上的体现。小模型的重复倾向与轻微史实/事实瑕疵（v7 "Water…made up of carbon"）两版都有。val 的边际小提升，没有兑换成采样上的明显增益，进一步印证「同子集多 epoch 近顶」。

xserv 验证

导出 HF Qwen3 safetensors（命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16，见 T9 docs/08，201 tensors， config.json 与 v4/v5/v6 一字不差）存入 registry，用 xserv-cli 加载并贪心生成：

$ xserv-cli ~/projects/tiny-models/v7-fineweb-edu-dim768 --max-tokens 50
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
xserv> The history of the city of New York is a story of many different people. The first inhabitants of the
       city were the Native Americans. The first Europeans arrived in the 16th century… the Dutch.
xserv> In science, the term "science" is used to describe the study of the natural world. It is a broad term
       that encompasses a wide range of disciplines, including biology, chemistry, physics, and mathematics.
xserv> Water is a natural substance that is found in the earth's crust… a very important element in the Earth's
       ecosystem.

token-match：v7 训练即 bf16（fp32 master），权重本就在 bf16 数值域里收敛，导出 BF16 给 xserv 后两侧数值路径一致——同 v4/v5/v6 的闭环。xserv 加载 qwen3 layers=18 hidden=768 201 tensors、KV-cache、贪心生成，闭环成立。

相比 v6 与 v8 提案

v7 把 v6「先吃满数据轴」这条提案落了地，并得出一个校正性的结论：同一 2.255B FineWeb 子集多喂 epoch，在 dim768 上边际很薄（1.02→1.45ep 仅 val ↓0.05、采样无质变、曲线走平）= 近天花板。所以「更多数据」这条最便宜的杠杆，前提是数据要真的更多（新 shards），不是同一子集重复。v8 的杠杆按收益重排：

新 FineWeb shards（真·更多数据，首选）：再下载 sample/10BT 之外的分片（或 100BT 子集），提供更多样、不重复的 token——这才是 v6 单调下降曲线真正承诺的收益。⚠️ 磁盘紧（dash5 ~18G 余），需把 parquet/中间 .txt 溢出到 /dashscope-tmp/wjh、用完即删。
更大模型（dim1024+，容量轴）：v7 证明 127M-core 在「同子集」上吃不动更多，但没说它在「更多样数据」上也到顶——要判断是否 capacity-limited，需配新数据一起测。dim1024+ 激活显存上升，需先做 KI-3 激活重计算。
数据混合（TinyStories + FineWeb）：治 v6 暴露的 transfer 退化（1.11→2.75），为「连贯 + 广度」服务，不是为「更低通用 val」服务，优先级看目标。

我的判断：v8 应走 1（新 FineWeb shards）——v7 已经证明「重复老数据」这条路到头了，下一步必须给模型没见过的 token。这也顺带能回答 2：在新数据上若 val 仍快速降，则容量未到顶（再上 dim1024）；若也很快走平，才是真 capacity-limited。

11 KiB Raw Blame History Unescape Escape