docs: run v3 — TinyStories, dim512, val 1.30

Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 03:37:45 +08:00
parent 64b2a8c09e
commit a78502e0f0

View File

@@ -0,0 +1,198 @@
# Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document
## Goal
在 v2dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡)之上,沿**模型 + 数据**两个轴继续放大,并把
训练改回**单卡**——这次单卡不是退步,而是 **T10 batched forward** 落地后的正确选择:
1. **模型放大**dim 384→512、层 12→16、头 12→16**transformer core 做到 ~67M 参**(容量 ×2.4
词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M,单列出来。
2. **数据放大**v2 只训了 ~37M token仍欠拟合val 一路降到末步v3 训 **245.8M token**×6.7
仍复用 v1 缓存的全量 TinyStories token-id 流468M token**~0.53 epoch、无重复**。
3. **单卡 batchedT10避开 KI-5**v2 暴露的 KI-1DDP 弱扩展)根因被 T10 证伪并修复——真正瓶颈
不是 all-reduce而是**单序列 forward 的逐 op launch 开销**GPU util 0-15%。T10 把 forward 改成
`[B·S, dim]` flatten linears + fused batched causal SDPAcuBLAS strided-batchedattn 仅 3 launch
单卡吞吐 **1653→25627 tok/s15.5×**、util 升到 37-54%。v3 因此**单卡就够快**,避开了 DDP 尚未分桶
all-reduce 的 KI-5多卡才需要
4. 训完存 registry`~/projects/tiny-models/v3-tinystories-dim512/`+ 导出 xserv 格式验证可服务,给出
**相比 v2 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
> **这一版的工程意义**:在真实 scaling 规模67M core / 245.8M token / 2.65h)验证了 T10 的 batched
> forward——既是吞吐基础单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×也保持了数值正确batched ==
> 单序列等价、grad-check 全绿、xserv 闭环成立。v3 全程**单卡**,因此完全不触发 KI-5。
## 数据
| 项 | v2 | v3 |
|----|----|----|
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
| token 数(语料)| 468,260,367 | 同 |
| **训练消费 token** | ~36.9M4500 步 × 8192| **~245.8M**30000 步 × 8192|
| epoch 占比 | ~0.08 | **~0.53**(仍 <1 epoch无重复|
| tokenizer | gpt2 BPEvocab 50257| |
| 缓存 | `data/tinystories-train.txt.u16.bin`u16936MB| **直接复用** |
| held-out val | 全量末尾 1,000,000 token | **同一 1M token** v0/v1/v2 完全相同的保留集公平对比|
**复用缓存**`Corpus::load_cached` `<corpus>.u16.bin`启动即载入 467.26M train token末尾 1M val)。
held-out val 仍是全量末尾 1M token`split_tail` v0/v1/v2 同一保留集——**v0v3 val loss 直接可比**。
**数据阶梯**v3 仍喂 TinyStories~0.53 epoch未榨满模型也仍在 tiny-LM 范围core 已到 67M
TinyStories 成为容量上限 ~100M+ core尚有余量 v3 不换语料 core 进一步放大后再按数据阶梯
上更广高质语料
## 架构
v3 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA
forward 图与 v0/v1/v2 完全同构只是 dims 变大。**无结构改动**。
| 维度 | v2 | v3 |
|------|----|----|
| dim= heads·head_dim| 384 | **512** |
| n_layers | 12 | **16** |
| n_heads | 12 | **16** |
| head_dim | 32 | 32 |
| ffn_hiddenSwiGLU| 1536 | **2048** |
| vocab | 50257 | 50257 |
| **core 参数** embed+lm_head| 28,322,304(≈28.32M| **67,127,296≈67.13M×2.37** |
| embed + lm_head2×vocab×dim| 38,597,376(≈38.60M| 51,463,168(≈51.46M|
| **总参数** | 66,919,680(≈66.92M| **118,590,464≈118.59M** |
**core 的量法**`Config::core_params() = num_params() 2·vocab·dim`gpt2 50257 vocab dim512 下让
embedding + lm_head 固定占 ~51.46M——这两张表是**词表大小**的函数不是模型容量所以阶梯按 **core**
v3 core 67.13M)。注意v3 总参 118.59M embed/lm_head 仍占 ~43%51.46M gpt2 大词表占比问题
docs/known-issues.md KI-4)——dim 越大占比越降但在 dim512 仍是近一半参数
**相比 v2 的架构变化**纯放大dim 384512 / 1216 / 1216 / ffn 15362048无结构改动
阶梯已参数化v3 只改 `--dim/--heads/--layers/--ffn/--steps` flag不动模型代码
## 训练器:单卡 batchedT10
v2 DDPT84 global_batch=32 太小被 KI-1all-reduce 占比过高压住扩展性T10 排查后发现
**KI-1 的前提被证伪**v2 时代单卡只有 ~1653 tok/s 的真因不是通信而是**单序列 forward 每个 op 各自
launch**GPU 长期空转util 0-15%)。T10 的修复
- **flatten linears** `[B][S,dim]` 的逐序列 matmul 合成 `[B·S, dim] @ W` 一次大 GEMM
- **fused batched causal SDPA** `cublasStridedBatched` QKᵀ / softmax·V整个 attention **3
launch**而非 per-seq per-op)。
- **RoPE per-seq**pos = `row % S`batch flatten 后按序列内行号给位置)。
效果docs/09-batched-forward.md**单卡 165325627 tok/s15.5×)、batch32 ~40K tok/s24×)、
util 0-15%→37-54%**。全闸门绿15 grad-check / PyTorch B>1 对拍 / **batched == 单序列等价** / overfit /
DDP 一致 / xserv 闭环。v3 因此**单卡训练**~26K tok/s ≈ v2 DDP 4 卡(~3.6K tok/s**7×**,且不触发
KI-5DDP all-reduce 未分桶,只有重回多卡才需要)。
## 超参
| 项 | 值 | 备注 |
|----|----|----|
| optimizer | 手写 AdamWGPU 端 step| wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2|
| warmup | **1500 步**steps/20| |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.350.5,平稳 |
| steps | **30000** | ~2.65 小时 @ 单卡 |
| batch | **32** | **单卡 batched**T10一次 forward 吃 32 序列,非多次 SUM|
| seq_len | **256** | 同 v2 |
| tokens/step | 32×256 = 8192 | 总训练 token ≈ **245.8M**~0.53 epoch|
| world size | **1**(单卡 RTX 5090sm_120| 避开 KI-5 |
| 精度 | f32训练| 导出 xserv 时转 BF16见 T9|
**算力**dash5 单卡 RTX 5090全程 ~26,000 tok/s启动 ~28K稳态 ~26Kwall-clock ≈ **2.65 小时**
## 结果
- **train loss**start **10.9118** → end **~1.40**(末批 1.3993;全程平稳下降)
- **best / final val lossheld-out 1M tokenstep 29999****1.3027****超 ~1.4 目标**
- val loss 曲线(每 3000 步抽样,单调下降、末步仍在降、**无过拟合**
| step | 999 | 2999 | 5999 | 8999 | 11999 | 14999 | 17999 | 20999 | 23999 | 26999 | 29999 |
|------|-----|------|------|------|-------|-------|-------|-------|-------|-------|-------|
| val | 2.5205 | 1.8738 | 1.6878 | 1.5757 | 1.5080 | 1.4594 | 1.4077 | 1.3688 | 1.3389 | 1.3163 | **1.3027** |
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据或更大模型还能继续降v4 杠杆)。
### 采样greedyxtrain 直采,同 prompt
```
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the park. One day, she saw a big, scary dog. The dog barked
loudly and scared her. She ran
[The little] → The little girl was so excited. She wanted to try it out. She asked her mom
if she could go outside and play. Her mom said yes, so the little girl went
outside. The little girl
[One day] → One day, a little girl named Lily went to the park with her mom. They saw a
big tree with a swing. Lily wanted to play on the swing, but she was scared.
Her mom said,
```
温度 0.8 采样同样连贯(多角色、完整情节,如 Lily 摔坏 cushion 而哭、在书里发现新天地),见 `RUN.md` /
训练日志。
## 相比 v2 的提升
**best val loss各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|------|-----------|-----------|-------------------|------|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L单卡 |
| v2 | 28.32M×3.37| ~36.9M×7.2| 1.7055 | dim384/12L + DDPval 比 v1 低 0.88 |
| v3 | 67.13M**×2.37** vs v2| ~245.8M**×6.7** vs v2| **1.3027** | dim512/16L + 单卡 batchedval 比 v2 低 **0.40** |
**完整 val 阶梯v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30**——每一档都在同一 1M token 保留集上单调下降。
### 并排采样greedy 40 tokxserv 服务,同 prompt
| prompt | v2 | v3 |
|--------|----|----|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big** | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** |
| `One day` | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. **The little girl was scared and ran away.** | One day, **a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** |
| `The little` | The little girl was so happy and she thanked the man for his help. **She said goodbye and went home with a smile on her face.** | The little girl was so **excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside.** |
**结论**v228.32M core / 37M token已能写多步情节但桥段较套路、收束偏快。v367.13M core /
245.8M token在**相同开头**下展开更具体、更有内部因果的情节(看到狗→狗叫→吓到→逃跑;想玩秋千→但害怕→
妈妈出声),人物动机与转折更连贯,故事密度进一步提升。**best val 1.71→1.30(低 0.40+ 采样从"多步情节"
到"带动机/转折的连续叙事"**v3 是相对 v2 的清晰、可量化提升。
## xserv 验证
导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 `docs/08`
**179 tensors** = 16 层 × 11 + embed + norm + lm_head存入 registry 后用 `xserv-cli` 加载并贪心生成:
```
$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40
Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257
Loaded 179 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started
to stir the soup. She stirred and stirred until it was all mixed together. ...
xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with
a swing. Lily wanted to play on the swing, but she was scared. Her mom said,
```
**token-match**xservBF16对 xtrain 自身贪心F323 个 prompt 中 **2 个逐 token 完全一致**
"Once upon a time"、"One day");第 3 个("The little")在 "so excited." 之后分叉xtrain 续
"She wanted to try it out…"、xserv 续 "She ran to the kitchen…")——单个 logit 微差翻转贪心取值后序列发散,
与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v367M core规模仍成立多数 prompt 逐 token 一致,少数
因 BF16 末端分叉)。
## v4 提案
v3 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token 还能降。建议 v4
- **模型**dim 640768 / 2024 层 / ffn 25603072 → core ≈ **130200M**(容量 ×23。词表不变 →
在 dim768 下 embed+lm_head ~77M。
- **数据/步数**:把训练 token 从 245.8M 拉到 ~600M1B开始进入 TinyStories 多 epoch 区,或按数据阶梯
混入更广高质语料),目标 val 降到 ~1.01.1。
- **开放杠杆(按需启用)**
- **KI-5DDP all-reduce 未分桶)**:若 v4 想回到多卡,先做分桶 / overlapped all-reduce否则大模型
全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。
- **KI-2/KI-3bf16 fp32-master / 激活重计算)**模型变大后显存与算力压力上来bf16 混合精度 +
重计算开始有明显收益v0v3 tiny 规模延后了,理由见 docs/06
- **KI-4大词表占比**dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%core 继续放大会摊薄
占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
- **数据阶梯**core 到 ~100M+ 后 TinyStories 趋于容量上限v4 是开始**广化语料**TinyStories +
部分通用高质语料)的合适节点,同步评估 tokenizer。
阶梯已参数化v4 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可,多卡再叠 DDP需先修 KI-5
</content>