docs: run v4 — TinyStories, dim768, val 1.17
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep / arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128 per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8 DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU / v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32 batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table to v0/v1/v2/v3/v4 and the next-rung proposal to v5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
201
docs/runs/04-v4-tinystories-dim768.md
Normal file
201
docs/runs/04-v4-tinystories-dim768.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document
|
||||
|
||||
## Goal
|
||||
|
||||
在 v3(dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched)之上,沿**模型 + 数据**两个轴继续
|
||||
放大,并把训练**重回多卡**——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP,而是 **T11 缓存分配器**落地、
|
||||
8 卡近线性之后的正确选择:
|
||||
|
||||
1. **模型放大**:dim 512→768、层 16→18、头 16→24(head_dim 仍 32),把 **transformer core 做到 ~127M
|
||||
参**(容量 ×1.9),词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M,单列出来。
|
||||
2. **数据放大**:v3 训了 ~245.8M token(~0.53 epoch,仍欠拟合,val 一路降到末步);v4 训 **720.9M
|
||||
token**(×2.9),仍复用 v1 缓存的全量 TinyStories token-id 流(468M token),**~1.54 epoch**——首次
|
||||
越过 1 epoch、开始进入 TinyStories 多遍区。
|
||||
3. **8 卡 DDP fp32(T11,多卡近线性)**:v2 暴露的 KI-1(DDP 弱扩展)根因被 T10/T11 逐层证伪并修复——
|
||||
T10 修了单序列 launch-bound,T11 的 per-device size-classed caching allocator 进一步把 per-op
|
||||
`cudaMalloc` 串行消掉,**8 卡 49K→461K tok/s(scaling 1.3×→5×,全 8 卡 95-99% util)**。v4 因此放心回多卡,
|
||||
全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。
|
||||
4. 训完存 registry(`~/projects/tiny-models/v4-tinystories-dim768/`)+ 导出 xserv 格式验证可服务,给出
|
||||
**相比 v3 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
|
||||
|
||||
> **这一版的工程意义**:在真实 scaling 规模(127M core / 720.9M token / 84 min)**验证了 T11 缓存分配器
|
||||
> 在 dim768 的多卡扩展性**——8 卡全程 ~145K tok/s、95-99% util,比 v2 时代 4 卡 DDP(~3.6K tok/s)快
|
||||
> ~40×。同时 v4 是 **bf16(KI-2)的具体触发点**:dim768 fp32 在 32GB 显存里 per-rank batch 32(global 256)
|
||||
> OOM,被迫降到 per-rank 16(global 128)——这是 v0–v3 tiny 规模一直延后 bf16 后,第一次有 fp32 放不下的
|
||||
> 硬约束。
|
||||
|
||||
## 数据
|
||||
|
||||
| 项 | v3 | v4 |
|
||||
|----|----|----|
|
||||
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
|
||||
| token 数(语料)| 468,260,367 | 同 |
|
||||
| **训练消费 token** | ~245.8M(30000 步 × 8192)| **~720.9M**(22000 步 × 32768)|
|
||||
| epoch 占比 | ~0.53 | **~1.54**(首次越过 1 epoch)|
|
||||
| tokenizer | gpt2 BPE(vocab 50257)| 同 |
|
||||
| 缓存 | `data/tinystories-train.txt.u16.bin`(u16,936MB)| **直接复用** |
|
||||
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0/v1/v2/v3 完全相同的保留集,公平对比)|
|
||||
|
||||
**复用缓存**:`Corpus::load_cached` 读 `<corpus>.u16.bin`,启动即载入 467.26M train token(末尾 1M 留 val)。
|
||||
held-out val 仍是全量末尾 1M token(`split_tail`),与 v0–v3 同一保留集——**v0–v4 的 val loss 直接可比**。
|
||||
|
||||
**数据阶梯**:v4 是**首次越过 1 epoch**(~1.54)。core 已到 127M、~1.54 epoch 仍欠拟合(val 末步还在降),
|
||||
说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档(v5)是开始**广化语料**(TinyStories + 部分
|
||||
通用高质语料)或**继续榨 TinyStories 多 epoch** 的合适节点,同步评估 tokenizer(KI-4)。
|
||||
|
||||
## 架构
|
||||
|
||||
v4 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA),
|
||||
forward 图与 v0/v1/v2/v3 完全同构,只是 dims 变大。**无结构改动**。
|
||||
|
||||
| 维度 | v3 | v4 |
|
||||
|------|----|----|
|
||||
| dim(= heads·head_dim)| 512 | **768** |
|
||||
| n_layers | 16 | **18** |
|
||||
| n_heads | 16 | **24** |
|
||||
| head_dim | 32 | 32 |
|
||||
| ffn_hidden(SwiGLU)| 2048 | 2048 |
|
||||
| vocab | 50257 | 50257 |
|
||||
| **core 参数**(除 embed+lm_head)| 67,127,296(≈67.13M)| **127,432,704(≈127.43M,×1.90)** |
|
||||
| embed + lm_head(2×vocab×dim)| 51,463,168(≈51.46M)| 77,194,752(≈77.19M)|
|
||||
| **总参数** | 118,590,464(≈118.59M)| **204,627,456(≈204.63M)** |
|
||||
|
||||
**core 的量法**:`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim768 下让
|
||||
embedding + lm_head 固定占 ~77.19M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core** 量
|
||||
(v4 core 127.43M)。注意:v4 总参 204.63M 里 embed/lm_head 仍占 ~38%(77.19M),比 v3 的 43% 略降
|
||||
(dim 越大占比越摊薄),但仍是 gpt2 大词表占比问题(见 docs/known-issues.md KI-4)。
|
||||
|
||||
**相比 v3 的架构变化**:纯放大(dim 512→768 / 层 16→18 / 头 16→24,head_dim 与 ffn 不变),无结构改动。
|
||||
阶梯已参数化,v4 只改 `--dim/--heads/--layers/--ffn/--steps` flag,不动模型代码。
|
||||
|
||||
## 训练器:8 卡 DDP fp32(T11 缓存分配器加持)
|
||||
|
||||
v2 用 DDP(T8)4 卡,因 global_batch=32 太小被 KI-1(all-reduce 占比过高)压住扩展性。T10/T11 排查后把
|
||||
KI-1 的前提逐层证伪并修掉:
|
||||
|
||||
- **T10(batched forward)**:v2 时代单卡慢的真因不是通信,而是单序列 forward 每个 op 各自 launch
|
||||
(util 0-15%)。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
|
||||
- **T11(caching allocator)**:profile 证伪「分桶 all-reduce」(只占 7%)→ 真因是 per-op `cudaMalloc`
|
||||
串行。per-device size-classed caching allocator(Drop 归还、线程安全)→ **单卡 40K→93K tok/s(2.3×)、
|
||||
8 卡 49K→461K tok/s(9.4×,scaling 1.3×→5×,全 8 卡 95-99% util)**。
|
||||
|
||||
v4 因此放心回 8 卡 DDP fp32:thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step,
|
||||
跨 rank 参数 bit-identical。全程稳态 **~144,650 tok/s**、~84 min 训完 720.9M token,比 v2 时代 4 卡 DDP
|
||||
(~3.6K tok/s)快 ~40×。
|
||||
|
||||
⚠️ **batch 约束(bf16 触发点)**:dim768 fp32 在单卡 32GB 显存里 **per-rank batch 32(global 256)OOM**,
|
||||
被迫降到 **per-rank 16(global 128)**。这是 v0–v3 tiny 规模一直把 bf16(KI-2)延后后,第一次有 fp32
|
||||
放不下的硬约束——bf16(激活减半)能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。
|
||||
|
||||
## 超参
|
||||
|
||||
| 项 | 值 | 备注 |
|
||||
|----|----|----|
|
||||
| optimizer | 手写 AdamW(GPU 端 step)| wd=0.1,β/eps 用 xtrain-optim 默认 |
|
||||
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2/v3)|
|
||||
| warmup | **1100 步**(steps/20,lr 在 step 1100 达峰 6.00e-4)| |
|
||||
| grad clip | global-norm 1.0 | gnorm 全程 ~0.20–0.21,平稳 |
|
||||
| steps | **22000** | ~84 min @ 8 卡 |
|
||||
| global batch | **128**(per-rank 16 × world 8)| **8 卡 DDP**;per-rank 32 会 OOM(见上)|
|
||||
| seq_len | **256** | 同 v2/v3 |
|
||||
| tokens/step | 128×256 = 32768 | 总训练 token ≈ **720.9M**(~1.54 epoch)|
|
||||
| world size | **8**(RTX 5090,sm_120)| T11 修 KI-5 后多卡近线性 |
|
||||
| 精度 | f32(训练)| 导出 xserv 时转 BF16(见 T9)|
|
||||
|
||||
**算力**:dash5 8× RTX 5090,全程 ~144,650 tok/s(启动即 ~13万、step 50 起稳态 ~14.5万),wall-clock
|
||||
≈ **84 分钟**。
|
||||
|
||||
## 结果
|
||||
|
||||
- **train loss**:start **11.0689** → end **~1.14**(末批 1.1432;全程平稳下降)
|
||||
- **best / final val loss(held-out 1M token,step 21999)**:**1.1690**(接近 ~1.0-1.1 目标)
|
||||
- val loss 曲线(每 ~2000 步抽样,单调下降、末步仍在降、**无过拟合**):
|
||||
|
||||
| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 11999 | 13999 | 15999 | 17999 | 19999 | 21999 |
|
||||
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|
|
||||
| val | 2.5217 | 1.6493 | 1.4875 | 1.4056 | 1.3571 | 1.3161 | 1.2697 | 1.2414 | 1.2177 | 1.1978 | 1.1762 | **1.1690** |
|
||||
|
||||
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据(或更大模型)还能继续降(v5 杠杆)。
|
||||
|
||||
### 采样(greedy,xtrain 直采,同 prompt)
|
||||
|
||||
```
|
||||
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
|
||||
outside in the sunshine. One day, she saw a big, scary dog. The dog barked
|
||||
loudly and Lily got scared. She
|
||||
[One day] → One day, a little girl named Lily went to the park with her mom. She saw a
|
||||
big tree with a swing. Lily wanted to play on the swing, but she was too
|
||||
small. She asked her
|
||||
[The little] → The little girl was so happy that she had found the perfect place to hide.
|
||||
She stayed there for a long time, until it was time to go home. She said
|
||||
goodbye to the tree and ran back home
|
||||
```
|
||||
|
||||
## 相比 v3 的提升
|
||||
|
||||
**best val loss(各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**:
|
||||
|
||||
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|
||||
|------|-----------|-----------|-------------------|------|
|
||||
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
|
||||
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
|
||||
| v2 | 28.32M | ~36.9M | 1.7055 | dim384/12L + DDP 4 卡 |
|
||||
| v3 | 67.13M | ~245.8M(~0.53 ep)| 1.3027 | dim512/16L + 单卡 batched,val 比 v2 低 0.40 |
|
||||
| v4 | 127.43M(**×1.90** vs v3)| ~720.9M(**×2.9** vs v3,~1.54 ep)| **1.1690** | dim768/18L + **8 卡 DDP fp32**,val 比 v3 低 **0.13** |
|
||||
|
||||
**完整 val 阶梯:v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17**——每一档都在同一 1M token 保留集上
|
||||
单调下降。注意从 v3→v4 的 val 降幅(0.13)小于 v2→v3(0.40):边际收益递减是预期的(loss 越低越难再降),
|
||||
且 v4 仍欠拟合(末步还在降),说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。
|
||||
|
||||
### 并排采样(greedy 40 tok,xserv 服务,同 prompt)
|
||||
|
||||
| prompt | v3 | v4 |
|
||||
|--------|----|----|
|
||||
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She** |
|
||||
| `One day` | One day, a little girl named Lily went to the park with her mom. **They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** | One day, a little girl named Lily went to the park with her mom. **She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her** |
|
||||
| `The little` | The little girl was so **excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together.** | The little girl was so **happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home** |
|
||||
|
||||
**结论**:v3(67M core / 245.8M token)已能写带动机/转折的连续叙事;v4(127M core / 720.9M token /
|
||||
~1.54 epoch)在**相同开头**下情节更具体、动机更细("too small" 而非泛泛 "scared"、"perfect place to
|
||||
hide → stayed → said goodbye → ran back home" 的完整起承转合),收束更自然。**best val 1.30→1.17
|
||||
(低 0.13)+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事"**,v4 是相对 v3 的清晰、可量化提升。
|
||||
|
||||
## xserv 验证
|
||||
|
||||
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 `docs/08`,
|
||||
**201 tensors** = 18 层 × 11 + embed + norm + lm_head),存入 registry 后用 `xserv-cli` 加载并贪心生成:
|
||||
|
||||
```
|
||||
$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
|
||||
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
|
||||
Loaded 201 tensors
|
||||
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
|
||||
sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
|
||||
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
|
||||
a swing. Lily wanted to play on the swing, but she was too small. She asked her
|
||||
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
|
||||
for a long time, until it was time to go home. She said goodbye to the tree and ran back home
|
||||
```
|
||||
|
||||
**token-match**:xserv(BF16)对 xtrain 自身贪心(F32),**3 个 prompt 全部逐 token 完全一致**(40 tok
|
||||
内零分叉)——比 v3(2/3 一致)闭环更紧。BF16 漂移在 v4(127M core)规模、40 tok 长度内仍未翻转任何贪心
|
||||
取值,闭环成立。
|
||||
|
||||
## v5 提案
|
||||
|
||||
v4 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token / 更广语料还能降。建议 v5:
|
||||
|
||||
- **bf16(KI-2,现已触发)**:v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16
|
||||
混合精度(fp32 master),激活减半即可把 batch-256 甜点区找回(throughput 进一步↑、收敛更稳),这是 v5
|
||||
最该先拉的杠杆。
|
||||
- **数据**:v4 才 ~1.54 epoch 且仍欠拟合,**更多 TinyStories token**(多跑几个 epoch)大概率还能降 val;
|
||||
同时 core 已 127M,是按数据阶梯**开始广化语料**(TinyStories + 部分通用高质语料)的合适节点。两条都值得,
|
||||
先靠多 epoch TinyStories 验证「是否数据上限」,再决定是否换语料。
|
||||
- **开放杠杆(按需启用)**:
|
||||
- **process-per-GPU(更高 8 卡线性)**:v4 8 卡 ~145K tok/s 已近线性,但残留 ~7% all-reduce + PCIe;若
|
||||
v5 想把 8 卡推到更高线性,可从单进程 thread-per-GPU 改 process-per-GPU。
|
||||
- **KI-4(大词表占比)**:dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%;继续放大 core 会摊薄
|
||||
占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
|
||||
|
||||
阶梯已参数化,v5 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可;bf16 落地后 fp32/bf16 双路径并存
|
||||
(pool 已 dtype-agnostic,可干净叠加,见 T12 backlog)。
|
||||
@@ -21,10 +21,12 @@ val loss 一栏给的是各版**各自训练 run 报告的 best val**(held-out
|
||||
| [v1-tinystories-dim256](01-v1-tinystories-dim256.md) | TinyStories **全量 train** (468.3M tok, u16 缓存) | 256 / 8 / 8·32 / 1024 | 8.39M | 34.13M | **2.5847** | 全量数据 + dim256/8L;val 低 1.22,采样连贯成篇;~25.9min/单卡 |
|
||||
| [v2-tinystories-dim384](02-v2-tinystories-dim384.md) | TinyStories 全量 (复用 v1 缓存, 训 ~36.9M tok) | 384 / 12 / 12·32 / 1536 | 28.32M | 66.92M | **1.7055** | dim384/12L + **DDP 4 卡**;val 比 v1 低 0.88,情节更长;~2.8h/4 卡。⚠️ DDP 弱扩展见 [KI-1](../known-issues.md) |
|
||||
| [v3-tinystories-dim512](03-v3-tinystories-dim512.md) | TinyStories 全量 (复用 v1 缓存, 训 ~245.8M tok, ~0.53 epoch) | 512 / 16 / 16·32 / 2048 | 67.13M | 118.59M | **1.3027** | dim512/16L + **单卡 batched (T10)**;val 比 v2 低 0.40,带动机/转折的连续叙事;~2.65h/单卡 ~26K tok/s。T10 修 KI-1 根因(launch-bound),单卡避开 KI-5 |
|
||||
| [v4-tinystories-dim768](04-v4-tinystories-dim768.md) | TinyStories 全量 (复用 v1 缓存, 训 ~720.9M tok, ~1.54 epoch) | 768 / 18 / 24·32 / 2048 | 127.43M | 204.63M | **1.1690** | dim768/18L + **8 卡 DDP fp32**;val 比 v3 低 0.13,细节更具体、结构更完整;~84min/8 卡 ~145K tok/s。验证 T11 缓存分配器在 dim768 多卡扩展;⚠️ fp32 per-rank batch 32 OOM = bf16(KI-2) 触发点 |
|
||||
|
||||
## 下一档(提案)
|
||||
|
||||
- **v4**(待派发):见 `03-v3-*.md` 末尾 "v4 提案"——放大 dim640–768/20–24L (~130–200M core) +
|
||||
~600M–1B token,目标 val ~1.0–1.1;多卡需先修 KI-5(分桶 all-reduce),模型变大后启用 KI-2/3
|
||||
(bf16/重计算),并按数据阶梯开始广化语料(TinyStories + 通用高质语料)。
|
||||
- **v5**(待派发):见 `04-v4-*.md` 末尾 "v5 提案"——先上 **bf16(KI-2,v4 已触发:dim768 fp32 batch-32
|
||||
OOM)** 找回 batch-256 甜点区;数据上 v4 才 ~1.54 epoch 仍欠拟合,**更多 TinyStories token / 开始广化
|
||||
语料**(TinyStories + 通用高质语料)继续降 val;按需 process-per-GPU 提高 8 卡线性、换更贴合 tokenizer
|
||||
(KI-4)。
|
||||
</content>
|
||||
|
||||
Reference in New Issue
Block a user