docs: run v4 — TinyStories, dim768, val 1.17

Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 13:14:37 +08:00
parent 734e119db3
commit ff79fee3c5
2 changed files with 206 additions and 3 deletions

View File

@@ -0,0 +1,201 @@
# Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document
## Goal
在 v3dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched之上沿**模型 + 数据**两个轴继续
放大,并把训练**重回多卡**——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP而是 **T11 缓存分配器**落地、
8 卡近线性之后的正确选择:
1. **模型放大**dim 512→768、层 16→18、头 16→24head_dim 仍 32把 **transformer core 做到 ~127M
参**(容量 ×1.9),词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M,单列出来。
2. **数据放大**v3 训了 ~245.8M token~0.53 epoch仍欠拟合val 一路降到末步v4 训 **720.9M
token**×2.9),仍复用 v1 缓存的全量 TinyStories token-id 流468M token**~1.54 epoch**——首次
越过 1 epoch、开始进入 TinyStories 多遍区。
3. **8 卡 DDP fp32T11多卡近线性**v2 暴露的 KI-1DDP 弱扩展)根因被 T10/T11 逐层证伪并修复——
T10 修了单序列 launch-boundT11 的 per-device size-classed caching allocator 进一步把 per-op
`cudaMalloc` 串行消掉,**8 卡 49K→461K tok/sscaling 1.3×→5×全 8 卡 95-99% util**。v4 因此放心回多卡,
全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。
4. 训完存 registry`~/projects/tiny-models/v4-tinystories-dim768/`+ 导出 xserv 格式验证可服务,给出
**相比 v3 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
> **这一版的工程意义**:在真实 scaling 规模127M core / 720.9M token / 84 min**验证了 T11 缓存分配器
> 在 dim768 的多卡扩展性**——8 卡全程 ~145K tok/s、95-99% util比 v2 时代 4 卡 DDP~3.6K tok/s
> ~40×。同时 v4 是 **bf16KI-2的具体触发点**dim768 fp32 在 32GB 显存里 per-rank batch 32global 256
> OOM被迫降到 per-rank 16global 128——这是 v0v3 tiny 规模一直延后 bf16 后,第一次有 fp32 放不下的
> 硬约束。
## 数据
| 项 | v3 | v4 |
|----|----|----|
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
| token 数(语料)| 468,260,367 | 同 |
| **训练消费 token** | ~245.8M30000 步 × 8192| **~720.9M**22000 步 × 32768|
| epoch 占比 | ~0.53 | **~1.54**(首次越过 1 epoch|
| tokenizer | gpt2 BPEvocab 50257| 同 |
| 缓存 | `data/tinystories-train.txt.u16.bin`u16936MB| **直接复用** |
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0/v1/v2/v3 完全相同的保留集,公平对比)|
**复用缓存**`Corpus::load_cached``<corpus>.u16.bin`,启动即载入 467.26M train token末尾 1M 留 val
held-out val 仍是全量末尾 1M token`split_tail`),与 v0v3 同一保留集——**v0v4 的 val loss 直接可比**。
**数据阶梯**v4 是**首次越过 1 epoch**~1.54。core 已到 127M、~1.54 epoch 仍欠拟合val 末步还在降),
说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档v5是开始**广化语料**TinyStories + 部分
通用高质语料)或**继续榨 TinyStories 多 epoch** 的合适节点,同步评估 tokenizerKI-4
## 架构
v4 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA
forward 图与 v0/v1/v2/v3 完全同构,只是 dims 变大。**无结构改动**。
| 维度 | v3 | v4 |
|------|----|----|
| dim= heads·head_dim| 512 | **768** |
| n_layers | 16 | **18** |
| n_heads | 16 | **24** |
| head_dim | 32 | 32 |
| ffn_hiddenSwiGLU| 2048 | 2048 |
| vocab | 50257 | 50257 |
| **core 参数**(除 embed+lm_head| 67,127,296≈67.13M| **127,432,704≈127.43M×1.90** |
| embed + lm_head2×vocab×dim| 51,463,168≈51.46M| 77,194,752≈77.19M|
| **总参数** | 118,590,464≈118.59M| **204,627,456≈204.63M** |
**core 的量法**`Config::core_params() = num_params() 2·vocab·dim`。gpt2 50257 vocab 在 dim768 下让
embedding + lm_head 固定占 ~77.19M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core**
v4 core 127.43M。注意v4 总参 204.63M 里 embed/lm_head 仍占 ~38%77.19M),比 v3 的 43% 略降
dim 越大占比越摊薄),但仍是 gpt2 大词表占比问题(见 docs/known-issues.md KI-4
**相比 v3 的架构变化**纯放大dim 512→768 / 层 16→18 / 头 16→24head_dim 与 ffn 不变),无结构改动。
阶梯已参数化v4 只改 `--dim/--heads/--layers/--ffn/--steps` flag不动模型代码。
## 训练器8 卡 DDP fp32T11 缓存分配器加持)
v2 用 DDPT84 卡,因 global_batch=32 太小被 KI-1all-reduce 占比过高压住扩展性。T10/T11 排查后把
KI-1 的前提逐层证伪并修掉:
- **T10batched forward**v2 时代单卡慢的真因不是通信,而是单序列 forward 每个 op 各自 launch
util 0-15%。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
- **T11caching allocator**profile 证伪「分桶 all-reduce」只占 7%)→ 真因是 per-op `cudaMalloc`
串行。per-device size-classed caching allocatorDrop 归还、线程安全)→ **单卡 40K→93K tok/s2.3×)、
8 卡 49K→461K tok/s9.4×scaling 1.3×→5×全 8 卡 95-99% util**。
v4 因此放心回 8 卡 DDP fp32thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step
跨 rank 参数 bit-identical。全程稳态 **~144,650 tok/s**、~84 min 训完 720.9M token比 v2 时代 4 卡 DDP
~3.6K tok/s快 ~40×
⚠️ **batch 约束bf16 触发点)**dim768 fp32 在单卡 32GB 显存里 **per-rank batch 32global 256OOM**
被迫降到 **per-rank 16global 128**。这是 v0v3 tiny 规模一直把 bf16KI-2延后后第一次有 fp32
放不下的硬约束——bf16激活减半能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。
## 超参
| 项 | 值 | 备注 |
|----|----|----|
| optimizer | 手写 AdamWGPU 端 step| wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2/v3|
| warmup | **1100 步**steps/20lr 在 step 1100 达峰 6.00e-4| |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.200.21,平稳 |
| steps | **22000** | ~84 min @ 8 卡 |
| global batch | **128**per-rank 16 × world 8| **8 卡 DDP**per-rank 32 会 OOM见上|
| seq_len | **256** | 同 v2/v3 |
| tokens/step | 128×256 = 32768 | 总训练 token ≈ **720.9M**~1.54 epoch|
| world size | **8**RTX 5090sm_120| T11 修 KI-5 后多卡近线性 |
| 精度 | f32训练| 导出 xserv 时转 BF16见 T9|
**算力**dash5 8× RTX 5090全程 ~144,650 tok/s启动即 ~13万、step 50 起稳态 ~14.5万wall-clock
**84 分钟**
## 结果
- **train loss**start **11.0689** → end **~1.14**(末批 1.1432;全程平稳下降)
- **best / final val lossheld-out 1M tokenstep 21999****1.1690**(接近 ~1.0-1.1 目标)
- val loss 曲线(每 ~2000 步抽样,单调下降、末步仍在降、**无过拟合**
| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 11999 | 13999 | 15999 | 17999 | 19999 | 21999 |
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|
| val | 2.5217 | 1.6493 | 1.4875 | 1.4056 | 1.3571 | 1.3161 | 1.2697 | 1.2414 | 1.2177 | 1.1978 | 1.1762 | **1.1690** |
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据或更大模型还能继续降v5 杠杆)。
### 采样greedyxtrain 直采,同 prompt
```
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the sunshine. One day, she saw a big, scary dog. The dog barked
loudly and Lily got scared. She
[One day] → One day, a little girl named Lily went to the park with her mom. She saw a
big tree with a swing. Lily wanted to play on the swing, but she was too
small. She asked her
[The little] → The little girl was so happy that she had found the perfect place to hide.
She stayed there for a long time, until it was time to go home. She said
goodbye to the tree and ran back home
```
## 相比 v3 的提升
**best val loss各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|------|-----------|-----------|-------------------|------|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L单卡 |
| v2 | 28.32M | ~36.9M | 1.7055 | dim384/12L + DDP 4 卡 |
| v3 | 67.13M | ~245.8M~0.53 ep| 1.3027 | dim512/16L + 单卡 batchedval 比 v2 低 0.40 |
| v4 | 127.43M**×1.90** vs v3| ~720.9M**×2.9** vs v3~1.54 ep| **1.1690** | dim768/18L + **8 卡 DDP fp32**val 比 v3 低 **0.13** |
**完整 val 阶梯v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17**——每一档都在同一 1M token 保留集上
单调下降。注意从 v3→v4 的 val 降幅0.13)小于 v2→v30.40边际收益递减是预期的loss 越低越难再降),
且 v4 仍欠拟合(末步还在降),说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。
### 并排采样greedy 40 tokxserv 服务,同 prompt
| prompt | v3 | v4 |
|--------|----|----|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She** |
| `One day` | One day, a little girl named Lily went to the park with her mom. **They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** | One day, a little girl named Lily went to the park with her mom. **She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her** |
| `The little` | The little girl was so **excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together.** | The little girl was so **happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home** |
**结论**v367M core / 245.8M token已能写带动机/转折的连续叙事v4127M core / 720.9M token /
~1.54 epoch在**相同开头**下情节更具体、动机更细("too small" 而非泛泛 "scared"、"perfect place to
hide → stayed → said goodbye → ran back home" 的完整起承转合),收束更自然。**best val 1.30→1.17
(低 0.13+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事"**v4 是相对 v3 的清晰、可量化提升。
## xserv 验证
导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 `docs/08`
**201 tensors** = 18 层 × 11 + embed + norm + lm_head存入 registry 后用 `xserv-cli` 加载并贪心生成:
```
$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
a swing. Lily wanted to play on the swing, but she was too small. She asked her
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
for a long time, until it was time to go home. She said goodbye to the tree and ran back home
```
**token-match**xservBF16对 xtrain 自身贪心F32**3 个 prompt 全部逐 token 完全一致**40 tok
内零分叉)——比 v32/3 一致闭环更紧。BF16 漂移在 v4127M core规模、40 tok 长度内仍未翻转任何贪心
取值,闭环成立。
## v5 提案
v4 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token / 更广语料还能降。建议 v5
- **bf16KI-2现已触发**v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16
混合精度fp32 master激活减半即可把 batch-256 甜点区找回throughput 进一步↑、收敛更稳),这是 v5
最该先拉的杠杆。
- **数据**v4 才 ~1.54 epoch 且仍欠拟合,**更多 TinyStories token**(多跑几个 epoch大概率还能降 val
同时 core 已 127M是按数据阶梯**开始广化语料**TinyStories + 部分通用高质语料)的合适节点。两条都值得,
先靠多 epoch TinyStories 验证「是否数据上限」,再决定是否换语料。
- **开放杠杆(按需启用)**
- **process-per-GPU更高 8 卡线性)**v4 8 卡 ~145K tok/s 已近线性,但残留 ~7% all-reduce + PCIe
v5 想把 8 卡推到更高线性,可从单进程 thread-per-GPU 改 process-per-GPU。
- **KI-4大词表占比**dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%;继续放大 core 会摊薄
占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
阶梯已参数化v5 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可bf16 落地后 fp32/bf16 双路径并存
pool 已 dtype-agnostic可干净叠加见 T12 backlog

View File

@@ -21,10 +21,12 @@ val loss 一栏给的是各版**各自训练 run 报告的 best val**held-out
| [v1-tinystories-dim256](01-v1-tinystories-dim256.md) | TinyStories **全量 train** (468.3M tok, u16 缓存) | 256 / 8 / 8·32 / 1024 | 8.39M | 34.13M | **2.5847** | 全量数据 + dim256/8Lval 低 1.22,采样连贯成篇;~25.9min/单卡 |
| [v2-tinystories-dim384](02-v2-tinystories-dim384.md) | TinyStories 全量 (复用 v1 缓存, 训 ~36.9M tok) | 384 / 12 / 12·32 / 1536 | 28.32M | 66.92M | **1.7055** | dim384/12L + **DDP 4 卡**val 比 v1 低 0.88,情节更长;~2.8h/4 卡。⚠️ DDP 弱扩展见 [KI-1](../known-issues.md) |
| [v3-tinystories-dim512](03-v3-tinystories-dim512.md) | TinyStories 全量 (复用 v1 缓存, 训 ~245.8M tok, ~0.53 epoch) | 512 / 16 / 16·32 / 2048 | 67.13M | 118.59M | **1.3027** | dim512/16L + **单卡 batched (T10)**val 比 v2 低 0.40,带动机/转折的连续叙事;~2.65h/单卡 ~26K tok/s。T10 修 KI-1 根因(launch-bound),单卡避开 KI-5 |
| [v4-tinystories-dim768](04-v4-tinystories-dim768.md) | TinyStories 全量 (复用 v1 缓存, 训 ~720.9M tok, ~1.54 epoch) | 768 / 18 / 24·32 / 2048 | 127.43M | 204.63M | **1.1690** | dim768/18L + **8 卡 DDP fp32**val 比 v3 低 0.13,细节更具体、结构更完整;~84min/8 卡 ~145K tok/s。验证 T11 缓存分配器在 dim768 多卡扩展;⚠️ fp32 per-rank batch 32 OOM = bf16(KI-2) 触发点 |
## 下一档(提案)
- **v4**(待派发):见 `03-v3-*.md` 末尾 "v4 提案"——放大 dim640768/2024L (~130200M core) +
~600M1B token目标 val ~1.01.1;多卡需先修 KI-5分桶 all-reduce模型变大后启用 KI-2/3
(bf16/重计算)并按数据阶梯开始广化语料TinyStories + 通用高质语料)。
- **v5**(待派发):见 `04-v4-*.md` 末尾 "v5 提案"——先上 **bf16KI-2v4 已触发dim768 fp32 batch-32
OOM** 找回 batch-256 甜点区;数据上 v4 才 ~1.54 epoch 仍欠拟合,**更多 TinyStories token / 开始广化
语料**TinyStories + 通用高质语料)继续降 val按需 process-per-GPU 提高 8 卡线性、换更贴合 tokenizer
(KI-4)。
</content>