docs: run v3 — TinyStories, dim512, val 1.30
Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
198
docs/runs/03-v3-tinystories-dim512.md
Normal file
198
docs/runs/03-v3-tinystories-dim512.md
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
# Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
在 v2(dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡)之上,沿**模型 + 数据**两个轴继续放大,并把
|
||||||
|
训练改回**单卡**——这次单卡不是退步,而是 **T10 batched forward** 落地后的正确选择:
|
||||||
|
|
||||||
|
1. **模型放大**:dim 384→512、层 12→16、头 12→16,把 **transformer core 做到 ~67M 参**(容量 ×2.4),
|
||||||
|
词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M,单列出来。
|
||||||
|
2. **数据放大**:v2 只训了 ~37M token(仍欠拟合,val 一路降到末步);v3 训 **245.8M token**(×6.7),
|
||||||
|
仍复用 v1 缓存的全量 TinyStories token-id 流(468M token),**~0.53 epoch、无重复**。
|
||||||
|
3. **单卡 batched(T10,避开 KI-5)**:v2 暴露的 KI-1(DDP 弱扩展)根因被 T10 证伪并修复——真正瓶颈
|
||||||
|
不是 all-reduce,而是**单序列 forward 的逐 op launch 开销**(GPU util 0-15%)。T10 把 forward 改成
|
||||||
|
`[B·S, dim]` flatten linears + fused batched causal SDPA(cuBLAS strided-batched,attn 仅 3 launch),
|
||||||
|
单卡吞吐 **1653→25627 tok/s(15.5×)**、util 升到 37-54%。v3 因此**单卡就够快**,避开了 DDP 尚未分桶
|
||||||
|
all-reduce 的 KI-5(多卡才需要)。
|
||||||
|
4. 训完存 registry(`~/projects/tiny-models/v3-tinystories-dim512/`)+ 导出 xserv 格式验证可服务,给出
|
||||||
|
**相比 v2 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
|
||||||
|
|
||||||
|
> **这一版的工程意义**:在真实 scaling 规模(67M core / 245.8M token / 2.65h)验证了 T10 的 batched
|
||||||
|
> forward——既是吞吐基础(单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×),也保持了数值正确(batched ==
|
||||||
|
> 单序列等价、grad-check 全绿、xserv 闭环成立)。v3 全程**单卡**,因此完全不触发 KI-5。
|
||||||
|
|
||||||
|
## 数据
|
||||||
|
|
||||||
|
| 项 | v2 | v3 |
|
||||||
|
|----|----|----|
|
||||||
|
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
|
||||||
|
| token 数(语料)| 468,260,367 | 同 |
|
||||||
|
| **训练消费 token** | ~36.9M(4500 步 × 8192)| **~245.8M**(30000 步 × 8192)|
|
||||||
|
| epoch 占比 | ~0.08 | **~0.53**(仍 <1 epoch,无重复)|
|
||||||
|
| tokenizer | gpt2 BPE(vocab 50257)| 同 |
|
||||||
|
| 缓存 | `data/tinystories-train.txt.u16.bin`(u16,936MB)| **直接复用** |
|
||||||
|
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0/v1/v2 完全相同的保留集,公平对比)|
|
||||||
|
|
||||||
|
**复用缓存**:`Corpus::load_cached` 读 `<corpus>.u16.bin`,启动即载入 467.26M train token(末尾 1M 留 val)。
|
||||||
|
held-out val 仍是全量末尾 1M token(`split_tail`),与 v0/v1/v2 同一保留集——**v0–v3 的 val loss 直接可比**。
|
||||||
|
|
||||||
|
**数据阶梯**:v3 仍喂 TinyStories(~0.53 epoch,未榨满,模型也仍在 tiny-LM 范围);core 已到 67M,距
|
||||||
|
TinyStories 成为容量上限(需 ~100M+ core)尚有余量,故 v3 不换语料。待 core 进一步放大后再按数据阶梯
|
||||||
|
上更广高质语料。
|
||||||
|
|
||||||
|
## 架构
|
||||||
|
|
||||||
|
v3 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA),
|
||||||
|
forward 图与 v0/v1/v2 完全同构,只是 dims 变大。**无结构改动**。
|
||||||
|
|
||||||
|
| 维度 | v2 | v3 |
|
||||||
|
|------|----|----|
|
||||||
|
| dim(= heads·head_dim)| 384 | **512** |
|
||||||
|
| n_layers | 12 | **16** |
|
||||||
|
| n_heads | 12 | **16** |
|
||||||
|
| head_dim | 32 | 32 |
|
||||||
|
| ffn_hidden(SwiGLU)| 1536 | **2048** |
|
||||||
|
| vocab | 50257 | 50257 |
|
||||||
|
| **core 参数**(除 embed+lm_head)| 28,322,304(≈28.32M)| **67,127,296(≈67.13M,×2.37)** |
|
||||||
|
| embed + lm_head(2×vocab×dim)| 38,597,376(≈38.60M)| 51,463,168(≈51.46M)|
|
||||||
|
| **总参数** | 66,919,680(≈66.92M)| **118,590,464(≈118.59M)** |
|
||||||
|
|
||||||
|
**core 的量法**:`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim512 下让
|
||||||
|
embedding + lm_head 固定占 ~51.46M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core** 量
|
||||||
|
(v3 core 67.13M)。注意:v3 总参 118.59M 里 embed/lm_head 仍占 ~43%(51.46M),是 gpt2 大词表占比问题
|
||||||
|
(见 docs/known-issues.md KI-4)——dim 越大占比越降,但在 dim512 仍是近一半参数。
|
||||||
|
|
||||||
|
**相比 v2 的架构变化**:纯放大(dim 384→512 / 层 12→16 / 头 12→16 / ffn 1536→2048),无结构改动。
|
||||||
|
阶梯已参数化,v3 只改 `--dim/--heads/--layers/--ffn/--steps` flag,不动模型代码。
|
||||||
|
|
||||||
|
## 训练器:单卡 batched(T10)
|
||||||
|
|
||||||
|
v2 用 DDP(T8)4 卡,因 global_batch=32 太小被 KI-1(all-reduce 占比过高)压住扩展性。T10 排查后发现
|
||||||
|
**KI-1 的前提被证伪**:v2 时代单卡只有 ~1653 tok/s 的真因不是通信,而是**单序列 forward 每个 op 各自
|
||||||
|
launch**(GPU 长期空转,util 0-15%)。T10 的修复:
|
||||||
|
|
||||||
|
- **flatten linears**:把 `[B][S,dim]` 的逐序列 matmul 合成 `[B·S, dim] @ W` 一次大 GEMM。
|
||||||
|
- **fused batched causal SDPA**:用 `cublasStridedBatched` 做 QKᵀ / softmax·V,整个 attention **3 个
|
||||||
|
launch**(而非 per-seq per-op)。
|
||||||
|
- **RoPE per-seq**:pos = `row % S`(batch flatten 后按序列内行号给位置)。
|
||||||
|
|
||||||
|
效果(docs/09-batched-forward.md):**单卡 1653→25627 tok/s(15.5×)、batch32 时 ~40K tok/s(24×)、
|
||||||
|
util 0-15%→37-54%**。全闸门绿(15 grad-check / PyTorch B>1 对拍 / **batched == 单序列等价** / overfit /
|
||||||
|
DDP 一致 / xserv 闭环)。v3 因此**单卡训练**:~26K tok/s ≈ v2 DDP 4 卡(~3.6K tok/s)的 **7×**,且不触发
|
||||||
|
KI-5(DDP all-reduce 未分桶,只有重回多卡才需要)。
|
||||||
|
|
||||||
|
## 超参
|
||||||
|
|
||||||
|
| 项 | 值 | 备注 |
|
||||||
|
|----|----|----|
|
||||||
|
| optimizer | 手写 AdamW(GPU 端 step)| wd=0.1,β/eps 用 xtrain-optim 默认 |
|
||||||
|
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2)|
|
||||||
|
| warmup | **1500 步**(steps/20)| |
|
||||||
|
| grad clip | global-norm 1.0 | gnorm 全程 ~0.35–0.5,平稳 |
|
||||||
|
| steps | **30000** | ~2.65 小时 @ 单卡 |
|
||||||
|
| batch | **32** | **单卡 batched**(T10:一次 forward 吃 32 序列,非多次 SUM)|
|
||||||
|
| seq_len | **256** | 同 v2 |
|
||||||
|
| tokens/step | 32×256 = 8192 | 总训练 token ≈ **245.8M**(~0.53 epoch)|
|
||||||
|
| world size | **1**(单卡 RTX 5090,sm_120)| 避开 KI-5 |
|
||||||
|
| 精度 | f32(训练)| 导出 xserv 时转 BF16(见 T9)|
|
||||||
|
|
||||||
|
**算力**:dash5 单卡 RTX 5090,全程 ~26,000 tok/s(启动 ~28K,稳态 ~26K),wall-clock ≈ **2.65 小时**。
|
||||||
|
|
||||||
|
## 结果
|
||||||
|
|
||||||
|
- **train loss**:start **10.9118** → end **~1.40**(末批 1.3993;全程平稳下降)
|
||||||
|
- **best / final val loss(held-out 1M token,step 29999)**:**1.3027**(**超 ~1.4 目标**)
|
||||||
|
- val loss 曲线(每 3000 步抽样,单调下降、末步仍在降、**无过拟合**):
|
||||||
|
|
||||||
|
| step | 999 | 2999 | 5999 | 8999 | 11999 | 14999 | 17999 | 20999 | 23999 | 26999 | 29999 |
|
||||||
|
|------|-----|------|------|------|-------|-------|-------|-------|-------|-------|-------|
|
||||||
|
| val | 2.5205 | 1.8738 | 1.6878 | 1.5757 | 1.5080 | 1.4594 | 1.4077 | 1.3688 | 1.3389 | 1.3163 | **1.3027** |
|
||||||
|
|
||||||
|
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据(或更大模型)还能继续降(v4 杠杆)。
|
||||||
|
|
||||||
|
### 采样(greedy,xtrain 直采,同 prompt)
|
||||||
|
|
||||||
|
```
|
||||||
|
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
|
||||||
|
outside in the park. One day, she saw a big, scary dog. The dog barked
|
||||||
|
loudly and scared her. She ran
|
||||||
|
[The little] → The little girl was so excited. She wanted to try it out. She asked her mom
|
||||||
|
if she could go outside and play. Her mom said yes, so the little girl went
|
||||||
|
outside. The little girl
|
||||||
|
[One day] → One day, a little girl named Lily went to the park with her mom. They saw a
|
||||||
|
big tree with a swing. Lily wanted to play on the swing, but she was scared.
|
||||||
|
Her mom said,
|
||||||
|
```
|
||||||
|
|
||||||
|
温度 0.8 采样同样连贯(多角色、完整情节,如 Lily 摔坏 cushion 而哭、在书里发现新天地),见 `RUN.md` /
|
||||||
|
训练日志。
|
||||||
|
|
||||||
|
## 相比 v2 的提升
|
||||||
|
|
||||||
|
**best val loss(各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**:
|
||||||
|
|
||||||
|
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|
||||||
|
|------|-----------|-----------|-------------------|------|
|
||||||
|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
|
||||||
|
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
|
||||||
|
| v2 | 28.32M(×3.37)| ~36.9M(×7.2)| 1.7055 | dim384/12L + DDP,val 比 v1 低 0.88 |
|
||||||
|
| v3 | 67.13M(**×2.37** vs v2)| ~245.8M(**×6.7** vs v2)| **1.3027** | dim512/16L + 单卡 batched,val 比 v2 低 **0.40** |
|
||||||
|
|
||||||
|
**完整 val 阶梯:v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30**——每一档都在同一 1M token 保留集上单调下降。
|
||||||
|
|
||||||
|
### 并排采样(greedy 40 tok,xserv 服务,同 prompt)
|
||||||
|
|
||||||
|
| prompt | v2 | v3 |
|
||||||
|
|--------|----|----|
|
||||||
|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big** | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** |
|
||||||
|
| `One day` | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. **The little girl was scared and ran away.** | One day, **a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** |
|
||||||
|
| `The little` | The little girl was so happy and she thanked the man for his help. **She said goodbye and went home with a smile on her face.** | The little girl was so **excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside.** |
|
||||||
|
|
||||||
|
**结论**:v2(28.32M core / 37M token)已能写多步情节,但桥段较套路、收束偏快。v3(67.13M core /
|
||||||
|
245.8M token)在**相同开头**下展开更具体、更有内部因果的情节(看到狗→狗叫→吓到→逃跑;想玩秋千→但害怕→
|
||||||
|
妈妈出声),人物动机与转折更连贯,故事密度进一步提升。**best val 1.71→1.30(低 0.40)+ 采样从"多步情节"
|
||||||
|
到"带动机/转折的连续叙事"**,v3 是相对 v2 的清晰、可量化提升。
|
||||||
|
|
||||||
|
## xserv 验证
|
||||||
|
|
||||||
|
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 `docs/08`,
|
||||||
|
**179 tensors** = 16 层 × 11 + embed + norm + lm_head),存入 registry 后用 `xserv-cli` 加载并贪心生成:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40
|
||||||
|
Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257
|
||||||
|
Loaded 179 tensors
|
||||||
|
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
|
||||||
|
park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
|
||||||
|
xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started
|
||||||
|
to stir the soup. She stirred and stirred until it was all mixed together. ...
|
||||||
|
xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with
|
||||||
|
a swing. Lily wanted to play on the swing, but she was scared. Her mom said,
|
||||||
|
```
|
||||||
|
|
||||||
|
**token-match**:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 中 **2 个逐 token 完全一致**
|
||||||
|
("Once upon a time"、"One day");第 3 个("The little")在 "so excited." 之后分叉(xtrain 续
|
||||||
|
"She wanted to try it out…"、xserv 续 "She ran to the kitchen…")——单个 logit 微差翻转贪心取值后序列发散,
|
||||||
|
与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v3(67M core)规模仍成立(多数 prompt 逐 token 一致,少数
|
||||||
|
因 BF16 末端分叉)。
|
||||||
|
|
||||||
|
## v4 提案
|
||||||
|
|
||||||
|
v3 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token 还能降。建议 v4:
|
||||||
|
|
||||||
|
- **模型**:dim 640–768 / 20–24 层 / ffn 2560–3072 → core ≈ **130–200M**(容量 ×2–3)。词表不变 →
|
||||||
|
在 dim768 下 embed+lm_head ~77M。
|
||||||
|
- **数据/步数**:把训练 token 从 245.8M 拉到 ~600M–1B(开始进入 TinyStories 多 epoch 区,或按数据阶梯
|
||||||
|
混入更广高质语料),目标 val 降到 ~1.0–1.1。
|
||||||
|
- **开放杠杆(按需启用)**:
|
||||||
|
- **KI-5(DDP all-reduce 未分桶)**:若 v4 想回到多卡,先做分桶 / overlapped all-reduce,否则大模型
|
||||||
|
全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。
|
||||||
|
- **KI-2/KI-3(bf16 fp32-master / 激活重计算)**:模型变大后显存与算力压力上来,bf16 混合精度 +
|
||||||
|
重计算开始有明显收益(v0–v3 tiny 规模延后了,理由见 docs/06)。
|
||||||
|
- **KI-4(大词表占比)**:dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%;core 继续放大会摊薄
|
||||||
|
占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
|
||||||
|
- **数据阶梯**:core 到 ~100M+ 后 TinyStories 趋于容量上限,v4 是开始**广化语料**(TinyStories +
|
||||||
|
部分通用高质语料)的合适节点,同步评估 tokenizer。
|
||||||
|
|
||||||
|
阶梯已参数化,v4 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可,多卡再叠 DDP(需先修 KI-5)。
|
||||||
|
</content>
|
||||||
Reference in New Issue
Block a user