From a78502e0f04ef3cf58b9789dd8fa3f911948b104 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Tue, 16 Jun 2026 03:37:45 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20run=20v3=20=E2=80=94=20TinyStories,=20d?= =?UTF-8?q?im512,=20val=201.30?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 --- docs/runs/03-v3-tinystories-dim512.md | 198 ++++++++++++++++++++++++++ 1 file changed, 198 insertions(+) create mode 100644 docs/runs/03-v3-tinystories-dim512.md diff --git a/docs/runs/03-v3-tinystories-dim512.md b/docs/runs/03-v3-tinystories-dim512.md new file mode 100644 index 0000000..88865f3 --- /dev/null +++ b/docs/runs/03-v3-tinystories-dim512.md @@ -0,0 +1,198 @@ +# Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document + +## Goal + +在 v2(dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡)之上,沿**模型 + 数据**两个轴继续放大,并把 +训练改回**单卡**——这次单卡不是退步,而是 **T10 batched forward** 落地后的正确选择: + +1. **模型放大**:dim 384→512、层 12→16、头 12→16,把 **transformer core 做到 ~67M 参**(容量 ×2.4), + 词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M,单列出来。 +2. **数据放大**:v2 只训了 ~37M token(仍欠拟合,val 一路降到末步);v3 训 **245.8M token**(×6.7), + 仍复用 v1 缓存的全量 TinyStories token-id 流(468M token),**~0.53 epoch、无重复**。 +3. **单卡 batched(T10,避开 KI-5)**:v2 暴露的 KI-1(DDP 弱扩展)根因被 T10 证伪并修复——真正瓶颈 + 不是 all-reduce,而是**单序列 forward 的逐 op launch 开销**(GPU util 0-15%)。T10 把 forward 改成 + `[B·S, dim]` flatten linears + fused batched causal SDPA(cuBLAS strided-batched,attn 仅 3 launch), + 单卡吞吐 **1653→25627 tok/s(15.5×)**、util 升到 37-54%。v3 因此**单卡就够快**,避开了 DDP 尚未分桶 + all-reduce 的 KI-5(多卡才需要)。 +4. 训完存 registry(`~/projects/tiny-models/v3-tinystories-dim512/`)+ 导出 xserv 格式验证可服务,给出 + **相比 v2 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。 + +> **这一版的工程意义**:在真实 scaling 规模(67M core / 245.8M token / 2.65h)验证了 T10 的 batched +> forward——既是吞吐基础(单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×),也保持了数值正确(batched == +> 单序列等价、grad-check 全绿、xserv 闭环成立)。v3 全程**单卡**,因此完全不触发 KI-5。 + +## 数据 + +| 项 | v2 | v3 | +|----|----|----| +| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 | +| token 数(语料)| 468,260,367 | 同 | +| **训练消费 token** | ~36.9M(4500 步 × 8192)| **~245.8M**(30000 步 × 8192)| +| epoch 占比 | ~0.08 | **~0.53**(仍 <1 epoch,无重复)| +| tokenizer | gpt2 BPE(vocab 50257)| 同 | +| 缓存 | `data/tinystories-train.txt.u16.bin`(u16,936MB)| **直接复用** | +| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0/v1/v2 完全相同的保留集,公平对比)| + +**复用缓存**:`Corpus::load_cached` 读 `.u16.bin`,启动即载入 467.26M train token(末尾 1M 留 val)。 +held-out val 仍是全量末尾 1M token(`split_tail`),与 v0/v1/v2 同一保留集——**v0–v3 的 val loss 直接可比**。 + +**数据阶梯**:v3 仍喂 TinyStories(~0.53 epoch,未榨满,模型也仍在 tiny-LM 范围);core 已到 67M,距 +TinyStories 成为容量上限(需 ~100M+ core)尚有余量,故 v3 不换语料。待 core 进一步放大后再按数据阶梯 +上更广高质语料。 + +## 架构 + +v3 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA), +forward 图与 v0/v1/v2 完全同构,只是 dims 变大。**无结构改动**。 + +| 维度 | v2 | v3 | +|------|----|----| +| dim(= heads·head_dim)| 384 | **512** | +| n_layers | 12 | **16** | +| n_heads | 12 | **16** | +| head_dim | 32 | 32 | +| ffn_hidden(SwiGLU)| 1536 | **2048** | +| vocab | 50257 | 50257 | +| **core 参数**(除 embed+lm_head)| 28,322,304(≈28.32M)| **67,127,296(≈67.13M,×2.37)** | +| embed + lm_head(2×vocab×dim)| 38,597,376(≈38.60M)| 51,463,168(≈51.46M)| +| **总参数** | 66,919,680(≈66.92M)| **118,590,464(≈118.59M)** | + +**core 的量法**:`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim512 下让 +embedding + lm_head 固定占 ~51.46M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core** 量 +(v3 core 67.13M)。注意:v3 总参 118.59M 里 embed/lm_head 仍占 ~43%(51.46M),是 gpt2 大词表占比问题 +(见 docs/known-issues.md KI-4)——dim 越大占比越降,但在 dim512 仍是近一半参数。 + +**相比 v2 的架构变化**:纯放大(dim 384→512 / 层 12→16 / 头 12→16 / ffn 1536→2048),无结构改动。 +阶梯已参数化,v3 只改 `--dim/--heads/--layers/--ffn/--steps` flag,不动模型代码。 + +## 训练器:单卡 batched(T10) + +v2 用 DDP(T8)4 卡,因 global_batch=32 太小被 KI-1(all-reduce 占比过高)压住扩展性。T10 排查后发现 +**KI-1 的前提被证伪**:v2 时代单卡只有 ~1653 tok/s 的真因不是通信,而是**单序列 forward 每个 op 各自 +launch**(GPU 长期空转,util 0-15%)。T10 的修复: + +- **flatten linears**:把 `[B][S,dim]` 的逐序列 matmul 合成 `[B·S, dim] @ W` 一次大 GEMM。 +- **fused batched causal SDPA**:用 `cublasStridedBatched` 做 QKᵀ / softmax·V,整个 attention **3 个 + launch**(而非 per-seq per-op)。 +- **RoPE per-seq**:pos = `row % S`(batch flatten 后按序列内行号给位置)。 + +效果(docs/09-batched-forward.md):**单卡 1653→25627 tok/s(15.5×)、batch32 时 ~40K tok/s(24×)、 +util 0-15%→37-54%**。全闸门绿(15 grad-check / PyTorch B>1 对拍 / **batched == 单序列等价** / overfit / +DDP 一致 / xserv 闭环)。v3 因此**单卡训练**:~26K tok/s ≈ v2 DDP 4 卡(~3.6K tok/s)的 **7×**,且不触发 +KI-5(DDP all-reduce 未分桶,只有重回多卡才需要)。 + +## 超参 + +| 项 | 值 | 备注 | +|----|----|----| +| optimizer | 手写 AdamW(GPU 端 step)| wd=0.1,β/eps 用 xtrain-optim 默认 | +| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2)| +| warmup | **1500 步**(steps/20)| | +| grad clip | global-norm 1.0 | gnorm 全程 ~0.35–0.5,平稳 | +| steps | **30000** | ~2.65 小时 @ 单卡 | +| batch | **32** | **单卡 batched**(T10:一次 forward 吃 32 序列,非多次 SUM)| +| seq_len | **256** | 同 v2 | +| tokens/step | 32×256 = 8192 | 总训练 token ≈ **245.8M**(~0.53 epoch)| +| world size | **1**(单卡 RTX 5090,sm_120)| 避开 KI-5 | +| 精度 | f32(训练)| 导出 xserv 时转 BF16(见 T9)| + +**算力**:dash5 单卡 RTX 5090,全程 ~26,000 tok/s(启动 ~28K,稳态 ~26K),wall-clock ≈ **2.65 小时**。 + +## 结果 + +- **train loss**:start **10.9118** → end **~1.40**(末批 1.3993;全程平稳下降) +- **best / final val loss(held-out 1M token,step 29999)**:**1.3027**(**超 ~1.4 目标**) +- val loss 曲线(每 3000 步抽样,单调下降、末步仍在降、**无过拟合**): + +| step | 999 | 2999 | 5999 | 8999 | 11999 | 14999 | 17999 | 20999 | 23999 | 26999 | 29999 | +|------|-----|------|------|------|-------|-------|-------|-------|-------|-------|-------| +| val | 2.5205 | 1.8738 | 1.6878 | 1.5757 | 1.5080 | 1.4594 | 1.4077 | 1.3688 | 1.3389 | 1.3163 | **1.3027** | + +val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据(或更大模型)还能继续降(v4 杠杆)。 + +### 采样(greedy,xtrain 直采,同 prompt) + +``` +[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play + outside in the park. One day, she saw a big, scary dog. The dog barked + loudly and scared her. She ran +[The little] → The little girl was so excited. She wanted to try it out. She asked her mom + if she could go outside and play. Her mom said yes, so the little girl went + outside. The little girl +[One day] → One day, a little girl named Lily went to the park with her mom. They saw a + big tree with a swing. Lily wanted to play on the swing, but she was scared. + Her mom said, +``` + +温度 0.8 采样同样连贯(多角色、完整情节,如 Lily 摔坏 cushion 而哭、在书里发现新天地),见 `RUN.md` / +训练日志。 + +## 相比 v2 的提升 + +**best val loss(各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**: + +| 模型 | core 参数 | 训练 token | **best val loss** | 说明 | +|------|-----------|-----------|-------------------|------| +| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 | +| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 | +| v2 | 28.32M(×3.37)| ~36.9M(×7.2)| 1.7055 | dim384/12L + DDP,val 比 v1 低 0.88 | +| v3 | 67.13M(**×2.37** vs v2)| ~245.8M(**×6.7** vs v2)| **1.3027** | dim512/16L + 单卡 batched,val 比 v2 低 **0.40** | + +**完整 val 阶梯:v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30**——每一档都在同一 1M token 保留集上单调下降。 + +### 并排采样(greedy 40 tok,xserv 服务,同 prompt) + +| prompt | v2 | v3 | +|--------|----|----| +| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big** | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** | +| `One day` | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. **The little girl was scared and ran away.** | One day, **a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** | +| `The little` | The little girl was so happy and she thanked the man for his help. **She said goodbye and went home with a smile on her face.** | The little girl was so **excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside.** | + +**结论**:v2(28.32M core / 37M token)已能写多步情节,但桥段较套路、收束偏快。v3(67.13M core / +245.8M token)在**相同开头**下展开更具体、更有内部因果的情节(看到狗→狗叫→吓到→逃跑;想玩秋千→但害怕→ +妈妈出声),人物动机与转折更连贯,故事密度进一步提升。**best val 1.71→1.30(低 0.40)+ 采样从"多步情节" +到"带动机/转折的连续叙事"**,v3 是相对 v2 的清晰、可量化提升。 + +## xserv 验证 + +导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 `docs/08`, +**179 tensors** = 16 层 × 11 + embed + norm + lm_head),存入 registry 后用 `xserv-cli` 加载并贪心生成: + +``` +$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40 +Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257 +Loaded 179 tensors +xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the + park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran +xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started + to stir the soup. She stirred and stirred until it was all mixed together. ... +xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with + a swing. Lily wanted to play on the swing, but she was scared. Her mom said, +``` + +**token-match**:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 中 **2 个逐 token 完全一致** +("Once upon a time"、"One day");第 3 个("The little")在 "so excited." 之后分叉(xtrain 续 +"She wanted to try it out…"、xserv 续 "She ran to the kitchen…")——单个 logit 微差翻转贪心取值后序列发散, +与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v3(67M core)规模仍成立(多数 prompt 逐 token 一致,少数 +因 BF16 末端分叉)。 + +## v4 提案 + +v3 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token 还能降。建议 v4: + +- **模型**:dim 640–768 / 20–24 层 / ffn 2560–3072 → core ≈ **130–200M**(容量 ×2–3)。词表不变 → + 在 dim768 下 embed+lm_head ~77M。 +- **数据/步数**:把训练 token 从 245.8M 拉到 ~600M–1B(开始进入 TinyStories 多 epoch 区,或按数据阶梯 + 混入更广高质语料),目标 val 降到 ~1.0–1.1。 +- **开放杠杆(按需启用)**: + - **KI-5(DDP all-reduce 未分桶)**:若 v4 想回到多卡,先做分桶 / overlapped all-reduce,否则大模型 + 全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。 + - **KI-2/KI-3(bf16 fp32-master / 激活重计算)**:模型变大后显存与算力压力上来,bf16 混合精度 + + 重计算开始有明显收益(v0–v3 tiny 规模延后了,理由见 docs/06)。 + - **KI-4(大词表占比)**:dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%;core 继续放大会摊薄 + 占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。 + - **数据阶梯**:core 到 ~100M+ 后 TinyStories 趋于容量上限,v4 是开始**广化语料**(TinyStories + + 部分通用高质语料)的合适节点,同步评估 tokenizer。 + +阶梯已参数化,v4 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可,多卡再叠 DDP(需先修 KI-5)。 +