docs: run v5 — TinyStories saturation at dim768 (val 1.11)
设计文档 05-v5-tinystories-dim768.md(中文,xserv 风格):数据 2.49B tok/5.33ep、 架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。 核心发现「数据天花板」:v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平 ⇒ TinyStories 在 dim768/127M-core 近饱和,v6 该换轴(更大模型/更广语料,非更多 TinyStories)。 xserv BF16 服务 3/3 prompt 逐 token 一致。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
196
docs/runs/05-v5-tinystories-dim768.md
Normal file
196
docs/runs/05-v5-tinystories-dim768.md
Normal file
@@ -0,0 +1,196 @@
|
|||||||
|
# Scaling Run v5: TinyStories 多遍(5.33 ep) + dim768/18L(同 v4) + 8 卡 DDP bf16 — Design Document
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
v4 把模型放大到 dim768/18L(core 127.43M),训了 ~720.9M token(~1.54 epoch),val 一路降到末步仍在降
|
||||||
|
= 仍**欠拟合**。当时留下一个明确问题:**TinyStories 这本语料对 127M core 模型,到底是不是数据上限?**
|
||||||
|
|
||||||
|
v5 是为回答这个问题专门设计的**对照实验**:
|
||||||
|
|
||||||
|
1. **架构完全冻结 = v4**(dim 768 / 24 heads × 32 head_dim / 18 layers / SwiGLU ffn 2048,core 127.43M,
|
||||||
|
总 204.63M)。**一个权重维度都不改**——这样 v4→v5 的 val 差异**只能归因于「更多数据」这一个变量**。
|
||||||
|
2. **数据放大到接近饱和**:v5 训 **~2.49B token = ~5.33 epoch**(v4 才 1.54 ep),同一份全量 TinyStories
|
||||||
|
token 流多跑 3.5×,看 val 还能不能继续降。
|
||||||
|
3. **bf16(T12/KI-2)找回甜点区**:v4 是 bf16 的触发点——dim768 fp32 在 32GB 显存里 per-rank batch 32
|
||||||
|
(global 256)OOM,被迫降到 per-rank 16(global 128)。v5 上 bf16 混合精度(fp32 master),激活减半
|
||||||
|
→ **找回 per-rank 32 / global 256 的甜点区**,同时吞吐从 v4 的 ~145K 升到 ~217K tok/s。
|
||||||
|
4. 训完存 registry + 导出 xserv 验证可服务,给出**相比 v4 的提升**与**数据天花板结论**。
|
||||||
|
|
||||||
|
> **这一版的工程意义**:v5 是 xtrain scaling 阶梯上第一个**有意不放大模型**的版本——它不是为了「更低的 val」,
|
||||||
|
> 而是为了**测准 TinyStories 在 dim768 的数据天花板**。结论(见下)很干脆:**3.5× 数据只换来 ~5% 的 val
|
||||||
|
> 改善,且末段 val 走平**——同尺寸模型在 TinyStories 上已近饱和,v6 该换轴(更大模型 / 更广语料),而**不是**
|
||||||
|
> 继续榨 TinyStories 的 epoch。同时 v5 兑现了 v4 留的 bf16 杠杆(KI-2):bf16 找回 global 256 甜点区,
|
||||||
|
> 8 卡稳态 ~217K tok/s。
|
||||||
|
|
||||||
|
## 数据
|
||||||
|
|
||||||
|
| 项 | v4 | v5 |
|
||||||
|
|----|----|----|
|
||||||
|
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
|
||||||
|
| token 数(语料)| 468,260,367 | 同 |
|
||||||
|
| **训练消费 token** | ~720.9M(22000 步 × 32768)| **~2.49B**(38000 步 × 65536)|
|
||||||
|
| epoch 占比 | ~1.54 | **~5.33**(×3.5)|
|
||||||
|
| tokenizer | gpt2 BPE(vocab 50257)| 同 |
|
||||||
|
| 缓存 | `data/tinystories-train.txt.u16.bin`(u16)| **直接复用** |
|
||||||
|
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0–v4 完全相同的保留集,公平对比)|
|
||||||
|
|
||||||
|
**复用缓存**:`Corpus::load_cached` 读 `<corpus>.u16.bin`,启动即载入 467.26M train token(末尾 1M 留 val)。
|
||||||
|
held-out val 仍是全量末尾 1M token(`split_tail`),与 v0–v4 同一保留集——**v0–v5 的 val loss 直接可比**。
|
||||||
|
|
||||||
|
**这一版数据设计的核心**:v4 才 1.54 epoch,v5 把同一份语料喂到 **5.33 epoch**(×3.5)。如果 TinyStories
|
||||||
|
对 127M core 还有数据空间,val 该继续显著下降;如果已近饱和,val 会迅速放缓、走平。**v5 就是来读这个信号的。**
|
||||||
|
|
||||||
|
## 架构
|
||||||
|
|
||||||
|
v5 = **与 v4 字节级同构的** tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA)。
|
||||||
|
**刻意一个维度都不改**,让「更多数据」成为唯一被测变量。
|
||||||
|
|
||||||
|
| 维度 | v4 | v5 |
|
||||||
|
|------|----|----|
|
||||||
|
| dim(= heads·head_dim)| 768 | **768(同)** |
|
||||||
|
| n_layers | 18 | **18(同)** |
|
||||||
|
| n_heads | 24 | **24(同)** |
|
||||||
|
| head_dim | 32 | 32(同)|
|
||||||
|
| ffn_hidden(SwiGLU)| 2048 | 2048(同)|
|
||||||
|
| vocab | 50257 | 50257(同)|
|
||||||
|
| **core 参数** | 127,432,704(≈127.43M)| **127,432,704(同)** |
|
||||||
|
| embed + lm_head(2×vocab×dim)| 77,194,752(≈77.19M)| 77,194,752(同)|
|
||||||
|
| **总参数** | 204,627,456(≈204.63M)| **204,627,456(同)** |
|
||||||
|
|
||||||
|
**为什么不放大模型**:scaling 实验里「数据」和「容量」两个变量若同时动,val 的变化无法归因。v4→v5 把容量冻死、
|
||||||
|
只动数据,得到的 val 差就**纯粹是数据的边际收益**——这是判断「TinyStories 是否到数据天花板」的唯一干净办法。
|
||||||
|
config.json 与 v4 完全一致(导出的 201 tensors 形状一字不差)。
|
||||||
|
|
||||||
|
## 训练器:8 卡 DDP bf16(T12 混合精度,fp32 master)
|
||||||
|
|
||||||
|
v4 用 8 卡 DDP **fp32**,被 dim768 的显存压到 per-rank batch 16(global 128)。v5 切 **bf16 混合精度**:
|
||||||
|
|
||||||
|
- **fp32 master 权重 + AdamW/clip/DDP 全部保持 fp32**(数值安全),只把 linears 走
|
||||||
|
`cublasGemmEx`(16BF 输入 / fp32 accum)、激活存 bf16;norm/softmax/rope/CE 仍 fp32。新增 cast autodiff op
|
||||||
|
桥接(fwd 降精度 / bwd 升精度)→ 优化器零改动。无 loss scaling(T12 实测 dim768 不需要)。
|
||||||
|
- **激活减半 → 找回甜点区**:bf16 把 dim768 的 per-rank batch 重新撑到 **32(global 256)**,正是 v4 因 fp32
|
||||||
|
OOM 失去的甜点区。同时吞吐 **~145K(v4 fp32)→ ~217K tok/s(v5 bf16,×1.5)**。
|
||||||
|
- **8 卡 thread-per-GPU**:all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step,跨 rank 参数 bit-identical
|
||||||
|
(T8/T11 已验证)。
|
||||||
|
|
||||||
|
全程稳态 **~217,000 tok/s**、wall-clock **~3.2h** 训完 2.49B token。bf16 收敛全程对住 fp32(T12 已做
|
||||||
|
150 步 3.984 vs 3.988 对拍),v5 的 train/val 曲线平滑无异常。
|
||||||
|
|
||||||
|
## 超参
|
||||||
|
|
||||||
|
| 项 | 值 | 备注 |
|
||||||
|
|----|----|----|
|
||||||
|
| optimizer | 手写 AdamW(GPU 端 step)| wd=0.1,β/eps 用 xtrain-optim 默认 |
|
||||||
|
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1–v4)|
|
||||||
|
| warmup | ~1900 步(lr 在 ~step 1900 达峰 6.00e-4,cosine 衰减到末步 6e-5)| |
|
||||||
|
| grad clip | global-norm 1.0 | gnorm 全程平稳(warmup 后 ~0.4 起持续下降)|
|
||||||
|
| steps | **38000** | ~3.2h @ 8 卡 |
|
||||||
|
| global batch | **256**(per-rank 32 × world 8)| **bf16 找回的甜点区**(v4 fp32 只能 128)|
|
||||||
|
| seq_len | **256** | 同 v2–v4 |
|
||||||
|
| tokens/step | 256×256 = 65536 | 总训练 token ≈ **2.49B**(~5.33 epoch)|
|
||||||
|
| world size | **8**(RTX 5090,sm_120)| |
|
||||||
|
| 精度 | **bf16 混合精度**(fp32 master)| T12/KI-2;导出 xserv 同样 BF16 |
|
||||||
|
|
||||||
|
**算力**:dash5 8× RTX 5090,全程 ~217,000 tok/s(step 50 起即稳态),wall-clock ≈ **3.2 小时**。
|
||||||
|
|
||||||
|
## 结果
|
||||||
|
|
||||||
|
- **train loss**:start **11.0675** → end **1.0588**(全程平滑下降)
|
||||||
|
- **best val loss(held-out 1M token,step 34999)**:**1.1102**
|
||||||
|
- **final val loss(step 37999)**:**1.1131**
|
||||||
|
- val loss 曲线(每 ~2000 步抽样):
|
||||||
|
|
||||||
|
| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 13999 | 17999 | 21999 | 25999 | 29999 | 33999 | **34999** | 37999 |
|
||||||
|
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|-------|-------|
|
||||||
|
| val | 2.6838 | 1.6033 | 1.4132 | 1.3317 | 1.2980 | 1.2596 | 1.2194 | 1.1846 | 1.1575 | 1.1374 | 1.1217 | 1.1151 | **1.1102** | 1.1131 |
|
||||||
|
|
||||||
|
### ⚠️ 数据天花板:末段走平
|
||||||
|
|
||||||
|
v5 的 val 在**末段明显走平**——这是 v4 单调下降曲线上看不到的新行为:
|
||||||
|
|
||||||
|
| step | 34999 | 35499 | 35999 | 36499 | 36999 | 37499 | 37999 |
|
||||||
|
|------|-------|-------|-------|-------|-------|-------|-------|
|
||||||
|
| val | **1.1102 (best)** | 1.1126 | 1.1131 | 1.1135 | 1.1119 | 1.1143 | 1.1131 |
|
||||||
|
|
||||||
|
best(1.1102)出现在 step 34999,之后 3000 步 val 在 **1.1102–1.1143 的 ~0.004 带内来回抖动、不再单调下降**。
|
||||||
|
对比 v4 的 val 一路降到末步仍在降(欠拟合)——**v5 已经把 TinyStories 这本语料学到平台期**。
|
||||||
|
|
||||||
|
## 相比 v4 的提升 —— 以及数据天花板分析
|
||||||
|
|
||||||
|
**完整 val 阶梯(各版各自 best val,同一 1M token 保留集)**:
|
||||||
|
|
||||||
|
| 模型 | core 参数 | 训练 token | epoch | **best val** | 相比上一版 |
|
||||||
|
|------|-----------|-----------|-------|--------------|-----------|
|
||||||
|
| v0-baseline | 41K | ~0.72M | — | 3.8050 | — |
|
||||||
|
| v1 | 8.39M | ~5.1M | — | 2.5847 | ↓1.22 |
|
||||||
|
| v2 | 28.32M | ~36.9M | — | 1.7055 | ↓0.88 |
|
||||||
|
| v3 | 67.13M | ~245.8M | ~0.53 | 1.3027 | ↓0.40 |
|
||||||
|
| v4 | 127.43M | ~720.9M | ~1.54 | 1.1690 | ↓0.13 |
|
||||||
|
| **v5** | **127.43M(同 v4)** | **~2.49B(×3.5)** | **~5.33** | **1.1102** | **↓0.06(仅 ~5%)** |
|
||||||
|
|
||||||
|
**这是本版的核心发现**。把它和前几档对比:v2→v3 数据 ×6.7(val ↓0.40),v3→v4 数据 ×2.9 + 模型 ×1.9
|
||||||
|
(val ↓0.13)。而 **v4→v5:模型不变、数据 ×3.5,val 只 ↓0.06(~5%)**。结合末段走平:
|
||||||
|
|
||||||
|
> **结论:在 dim768 / 127M-core 这个尺寸,TinyStories 已接近数据饱和。**
|
||||||
|
> 同一份语料从 1.54 epoch 喂到 5.33 epoch(×3.5),val 仅改善 ~5% 且末段走平——**「更多 TinyStories token」
|
||||||
|
> 这条杠杆已经基本榨干**。这不是模型欠拟合(v4 那种末步仍降),而是**语料的信息量对这个尺寸的模型已学尽**。
|
||||||
|
|
||||||
|
**下一档(v6)的杠杆,按收益排序**:
|
||||||
|
|
||||||
|
1. **换轴:更大模型 或 更广语料**——这两条才是 v5 之后真正能继续降 val 的方向。
|
||||||
|
2. **不该做**:继续加 TinyStories 的 epoch。v5 已经证明 6+ epoch 的边际收益薄到不值得算力。
|
||||||
|
|
||||||
|
### 并排采样(greedy 40 tok,xserv 服务,同 prompt,v4 vs v5)
|
||||||
|
|
||||||
|
| prompt | v4 | v5 |
|
||||||
|
|--------|----|----|
|
||||||
|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, scary dog. She was scared and ran away.** | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, white cloud in the sky. It looked like a fluffy pillow.** |
|
||||||
|
| `One day` | One day, a little girl named Lily went to the park with her mom. She saw a big tree with a swing. Lily wanted to play on the swing, but she was **too small. She asked her** | One day, a little girl named Lily went to the park with her mom. She saw a big tree with a swing. Lily wanted to play on the swing, but she was **too small to reach it.** |
|
||||||
|
| `The little` | The little girl was so happy that she had found the perfect **place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree** | The little girl was so happy that she had found the perfect **thing to replace the broken one. She thanked her mom and they both went home with a smile on their faces.** |
|
||||||
|
|
||||||
|
**结论**:v4 和 v5 都写出完整、连贯、有收束的 TinyStories——两者**质量在同一水平**,差异是**情节走向的细微不同**
|
||||||
|
("scary dog → ran away" vs "white cloud → fluffy pillow"),**而非可感知的质量提升**。这与 val 仅 ↓0.06 完全一致:
|
||||||
|
**同尺寸模型多看 3.5× 数据,采样质量已无肉眼可见的提升**——这是数据天花板在生成侧的直接体现。
|
||||||
|
|
||||||
|
## xserv 验证
|
||||||
|
|
||||||
|
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 `docs/08`,
|
||||||
|
**201 tensors** = 18 层 × 11 + embed + norm + lm_head,config.json 与 v4 一字不差),存入 registry 后用
|
||||||
|
`xserv-cli` 加载并贪心生成:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ xserv-cli ~/projects/tiny-models/v5-tinystories-dim768 --max-tokens 40
|
||||||
|
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
|
||||||
|
Loaded 201 tensors
|
||||||
|
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the park.
|
||||||
|
One day, she saw a big, white cloud in the sky. It looked like a fluffy pillow.
|
||||||
|
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with a
|
||||||
|
swing. Lily wanted to play on the swing, but she was too small to reach it.
|
||||||
|
xserv> The little girl was so happy that she had found the perfect thing to replace the broken one.
|
||||||
|
She thanked her mom and they both went home with a smile on their faces.
|
||||||
|
```
|
||||||
|
|
||||||
|
**token-match**:xserv(BF16)对 xtrain 自身贪心,**3 个 prompt 全部逐 token 完全一致**(40 tok 内零分叉)。
|
||||||
|
v5 **训练即 bf16**(fp32 master),权重本就在 bf16 数值域里收敛,导出 BF16 给 xserv 后两侧贪心匹配更紧(v4
|
||||||
|
是 fp32 训练 → BF16 导出,已 3/3 一致;v5 同样 3/3 且数值路径更一致)。闭环成立。
|
||||||
|
|
||||||
|
## v6 提案
|
||||||
|
|
||||||
|
v5 给出了明确的数据天花板结论,v6 该**换轴**。两条候选:
|
||||||
|
|
||||||
|
- **A. 更大模型(dim 1024+)**:v5 证明 TinyStories 对 127M core 已饱和,但**更大的模型也许能从同一语料里
|
||||||
|
榨出更多**(容量上限尚未触顶)。注意 dim 越大,embed/lm_head 占比越摊薄(dim768 时 77.19M / 204.63M ≈ 38%,
|
||||||
|
dim1024 会降到 ~34%)→ **KI-4(大词表占比)的压力反而变小**。但若只换更大模型、仍喂 TinyStories,很可能
|
||||||
|
很快又撞上「这本语料的信息上限」——TinyStories 本身的内容多样性有限。
|
||||||
|
- **B. 更广语料(FineWeb-edu 等通用高质语料)+ 可能换 tokenizer(KI-4)**:v5 的天花板是**语料**的天花板,
|
||||||
|
不是模型的。换更丰富的语料能**抬高数据本身的信息量上限**,让更大的模型有东西可学。配合换更小/更贴合的
|
||||||
|
tokenizer(KI-4)可进一步降 embed/lm_head 浪费。
|
||||||
|
|
||||||
|
**我的判断:B 解锁的空间更大。** v5 的核心教训是「瓶颈在语料而非容量」——只放大模型(A)大概率撞上同一本
|
||||||
|
TinyStories 的信息天花板,收益有限;换更广语料(B)才是抬高天花板本身。理想的 v6 = **A+B 同时**(更大模型
|
||||||
|
吃更广语料),但若只能选一个,先 **B(广化语料)**。
|
||||||
|
|
||||||
|
**KI-3(激活重计算)是否需要**:若 v6 走 A(dim1024+ 更大模型),激活显存会显著上升,**bf16 已经省了一半**
|
||||||
|
(v5 验证),但更大 batch/更长 seq 下仍可能吃紧 → **届时 KI-3(激活重计算)才成为下一个显存杠杆**(T12 文档
|
||||||
|
已把它列为「bf16 之后的下一个显存杠杆」)。若 v6 只走 B(同 dim768 换语料),现有 bf16 + 缓存分配器够用,
|
||||||
|
**KI-3 暂不需要**。即:**KI-3 的触发条件 = v6 放大到 dim1024+**,与 A 路线绑定。
|
||||||
Reference in New Issue
Block a user