From a78502e0f04ef3cf58b9789dd8fa3f911948b104 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Tue, 16 Jun 2026 03:37:45 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20run=20v3=20=E2=80=94=20TinyStories,=20d?=
 =?UTF-8?q?im512,=20val=201.30?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/runs/03-v3-tinystories-dim512.md | 198 ++++++++++++++++++++++++++
 1 file changed, 198 insertions(+)
 create mode 100644 docs/runs/03-v3-tinystories-dim512.md
diff --git a/docs/runs/03-v3-tinystories-dim512.md b/docs/runs/03-v3-tinystories-dim512.md
new file mode 100644
index 0000000..88865f3
--- /dev/null
+++ b/docs/runs/03-v3-tinystories-dim512.md
@@ -0,0 +1,198 @@
+# Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document
+
+## Goal
+
+在 v2（dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡）之上，沿**模型 + 数据**两个轴继续放大，并把
+训练改回**单卡**——这次单卡不是退步，而是 **T10 batched forward** 落地后的正确选择：
+
+1. **模型放大**：dim 384→512、层 12→16、头 12→16，把 **transformer core 做到 ~67M 参**（容量 ×2.4），
+   词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M，单列出来。
+2. **数据放大**：v2 只训了 ~37M token（仍欠拟合，val 一路降到末步）；v3 训 **245.8M token**（×6.7），
+   仍复用 v1 缓存的全量 TinyStories token-id 流（468M token），**~0.53 epoch、无重复**。
+3. **单卡 batched（T10，避开 KI-5）**：v2 暴露的 KI-1（DDP 弱扩展）根因被 T10 证伪并修复——真正瓶颈
+   不是 all-reduce，而是**单序列 forward 的逐 op launch 开销**（GPU util 0-15%）。T10 把 forward 改成
+   `[B·S, dim]` flatten linears + fused batched causal SDPA（cuBLAS strided-batched，attn 仅 3 launch），
+   单卡吞吐 **1653→25627 tok/s（15.5×）**、util 升到 37-54%。v3 因此**单卡就够快**，避开了 DDP 尚未分桶
+   all-reduce 的 KI-5（多卡才需要）。
+4. 训完存 registry（`~/projects/tiny-models/v3-tinystories-dim512/`）+ 导出 xserv 格式验证可服务，给出
+   **相比 v2 的具体提升**（同一保留集 val loss + 同 prompt 并排采样）。
+
+> **这一版的工程意义**：在真实 scaling 规模（67M core / 245.8M token / 2.65h）验证了 T10 的 batched
+> forward——既是吞吐基础（单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×），也保持了数值正确（batched ==
+> 单序列等价、grad-check 全绿、xserv 闭环成立）。v3 全程**单卡**，因此完全不触发 KI-5。
+
+## 数据
+
+| 项 | v2 | v3 |
+|----|----|----|
+| 来源 | TinyStories **全量 train**（复用 v1 缓存）| 同 |
+| token 数（语料）| 468,260,367 | 同 |
+| **训练消费 token** | ~36.9M（4500 步 × 8192）| **~245.8M**（30000 步 × 8192）|
+| epoch 占比 | ~0.08 | **~0.53**（仍 <1 epoch，无重复）|
+| tokenizer | gpt2 BPE（vocab 50257）| 同 |
+| 缓存 | `data/tinystories-train.txt.u16.bin`（u16，936MB）| **直接复用** |
+| held-out val | 全量末尾 1,000,000 token | **同一 1M token**（与 v0/v1/v2 完全相同的保留集，公平对比）|
+
+**复用缓存**：`Corpus::load_cached` 读 `<corpus>.u16.bin`，启动即载入 467.26M train token（末尾 1M 留 val）。
+held-out val 仍是全量末尾 1M token（`split_tail`），与 v0/v1/v2 同一保留集——**v0–v3 的 val loss 直接可比**。
+
+**数据阶梯**：v3 仍喂 TinyStories（~0.53 epoch，未榨满，模型也仍在 tiny-LM 范围）；core 已到 67M，距
+TinyStories 成为容量上限（需 ~100M+ core）尚有余量，故 v3 不换语料。待 core 进一步放大后再按数据阶梯
+上更广高质语料。
+
+## 架构
+
+v3 = 更大、同构的 tiny Qwen3（RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head，MHA），
+forward 图与 v0/v1/v2 完全同构，只是 dims 变大。**无结构改动**。
+
+| 维度 | v2 | v3 |
+|------|----|----|
+| dim（= heads·head_dim）| 384 | **512** |
+| n_layers | 12 | **16** |
+| n_heads | 12 | **16** |
+| head_dim | 32 | 32 |
+| ffn_hidden（SwiGLU）| 1536 | **2048** |
+| vocab | 50257 | 50257 |
+| **core 参数**（除 embed+lm_head）| 28,322,304（≈28.32M）| **67,127,296（≈67.13M，×2.37）** |
+| embed + lm_head（2×vocab×dim）| 38,597,376（≈38.60M）| 51,463,168（≈51.46M）|
+| **总参数** | 66,919,680（≈66.92M）| **118,590,464（≈118.59M）** |
+
+**core 的量法**：`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim512 下让
+embedding + lm_head 固定占 ~51.46M——这两张表是**词表大小**的函数、不是模型容量，所以阶梯按 **core** 量
+（v3 core 67.13M）。注意：v3 总参 118.59M 里 embed/lm_head 仍占 ~43%（51.46M），是 gpt2 大词表占比问题
+（见 docs/known-issues.md KI-4）——dim 越大占比越降，但在 dim512 仍是近一半参数。
+
+**相比 v2 的架构变化**：纯放大（dim 384→512 / 层 12→16 / 头 12→16 / ffn 1536→2048），无结构改动。
+阶梯已参数化，v3 只改 `--dim/--heads/--layers/--ffn/--steps` flag，不动模型代码。
+
+## 训练器：单卡 batched（T10）
+
+v2 用 DDP（T8）4 卡，因 global_batch=32 太小被 KI-1（all-reduce 占比过高）压住扩展性。T10 排查后发现
+**KI-1 的前提被证伪**：v2 时代单卡只有 ~1653 tok/s 的真因不是通信，而是**单序列 forward 每个 op 各自
+launch**（GPU 长期空转，util 0-15%）。T10 的修复：
+
+- **flatten linears**：把 `[B][S,dim]` 的逐序列 matmul 合成 `[B·S, dim] @ W` 一次大 GEMM。
+- **fused batched causal SDPA**：用 `cublasStridedBatched` 做 QKᵀ / softmax·V，整个 attention **3 个
+  launch**（而非 per-seq per-op）。
+- **RoPE per-seq**：pos = `row % S`（batch flatten 后按序列内行号给位置）。
+
+效果（docs/09-batched-forward.md）：**单卡 1653→25627 tok/s（15.5×）、batch32 时 ~40K tok/s（24×）、
+util 0-15%→37-54%**。全闸门绿（15 grad-check / PyTorch B>1 对拍 / **batched == 单序列等价** / overfit /
+DDP 一致 / xserv 闭环）。v3 因此**单卡训练**：~26K tok/s ≈ v2 DDP 4 卡（~3.6K tok/s）的 **7×**，且不触发
+KI-5（DDP all-reduce 未分桶，只有重回多卡才需要）。
+
+## 超参
+
+| 项 | 值 | 备注 |
+|----|----|----|
+| optimizer | 手写 AdamW（GPU 端 step）| wd=0.1，β/eps 用 xtrain-optim 默认 |
+| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**（同 v1/v2）|
+| warmup | **1500 步**（steps/20）| |
+| grad clip | global-norm 1.0 | gnorm 全程 ~0.35–0.5，平稳 |
+| steps | **30000** | ~2.65 小时 @ 单卡 |
+| batch | **32** | **单卡 batched**（T10：一次 forward 吃 32 序列，非多次 SUM）|
+| seq_len | **256** | 同 v2 |
+| tokens/step | 32×256 = 8192 | 总训练 token ≈ **245.8M**（~0.53 epoch）|
+| world size | **1**（单卡 RTX 5090，sm_120）| 避开 KI-5 |
+| 精度 | f32（训练）| 导出 xserv 时转 BF16（见 T9）|
+
+**算力**：dash5 单卡 RTX 5090，全程 ~26,000 tok/s（启动 ~28K，稳态 ~26K），wall-clock ≈ **2.65 小时**。
+
+## 结果
+
+- **train loss**：start **10.9118** → end **~1.40**（末批 1.3993；全程平稳下降）
+- **best / final val loss（held-out 1M token，step 29999）**：**1.3027**（**超 ~1.4 目标**）
+- val loss 曲线（每 3000 步抽样，单调下降、末步仍在降、**无过拟合**）：
+
+| step | 999 | 2999 | 5999 | 8999 | 11999 | 14999 | 17999 | 20999 | 23999 | 26999 | 29999 |
+|------|-----|------|------|------|-------|-------|-------|-------|-------|-------|-------|
+| val  | 2.5205 | 1.8738 | 1.6878 | 1.5757 | 1.5080 | 1.4594 | 1.4077 | 1.3688 | 1.3389 | 1.3163 | **1.3027** |
+
+val 一路降到末步、无回升 = 仍**欠拟合**，更多步数/数据（或更大模型）还能继续降（v4 杠杆）。
+
+### 采样（greedy，xtrain 直采，同 prompt）
+
+```
+[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
+                     outside in the park. One day, she saw a big, scary dog. The dog barked
+                     loudly and scared her. She ran
+[The little]       → The little girl was so excited. She wanted to try it out. She asked her mom
+                     if she could go outside and play. Her mom said yes, so the little girl went
+                     outside. The little girl
+[One day]          → One day, a little girl named Lily went to the park with her mom. They saw a
+                     big tree with a swing. Lily wanted to play on the swing, but she was scared.
+                     Her mom said,
+```
+
+温度 0.8 采样同样连贯（多角色、完整情节，如 Lily 摔坏 cushion 而哭、在书里发现新天地），见 `RUN.md` /
+训练日志。
+
+## 相比 v2 的提升
+
+**best val loss（各版各自训练 run 报告的 held-out 1M token 最优值，同一保留集）**：
+
+| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
+|------|-----------|-----------|-------------------|------|
+| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片，采样退化循环 |
+| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L，单卡 |
+| v2 | 28.32M（×3.37）| ~36.9M（×7.2）| 1.7055 | dim384/12L + DDP，val 比 v1 低 0.88 |
+| v3 | 67.13M（**×2.37** vs v2）| ~245.8M（**×6.7** vs v2）| **1.3027** | dim512/16L + 单卡 batched，val 比 v2 低 **0.40** |
+
+**完整 val 阶梯：v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30**——每一档都在同一 1M token 保留集上单调下降。
+
+### 并排采样（greedy 40 tok，xserv 服务，同 prompt）
+
+| prompt | v2 | v3 |
+|--------|----|----|
+| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big** | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** |
+| `One day` | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. **The little girl was scared and ran away.** | One day, **a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** |
+| `The little` | The little girl was so happy and she thanked the man for his help. **She said goodbye and went home with a smile on her face.** | The little girl was so **excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside.** |
+
+**结论**：v2（28.32M core / 37M token）已能写多步情节，但桥段较套路、收束偏快。v3（67.13M core /
+245.8M token）在**相同开头**下展开更具体、更有内部因果的情节（看到狗→狗叫→吓到→逃跑；想玩秋千→但害怕→
+妈妈出声），人物动机与转折更连贯，故事密度进一步提升。**best val 1.71→1.30（低 0.40）+ 采样从"多步情节"
+到"带动机/转折的连续叙事"**，v3 是相对 v2 的清晰、可量化提升。
+
+## xserv 验证
+
+导出 HF Qwen3 safetensors（命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16，见 T9 `docs/08`，
+**179 tensors** = 16 层 × 11 + embed + norm + lm_head），存入 registry 后用 `xserv-cli` 加载并贪心生成：
+
+```
+$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40
+Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257
+Loaded 179 tensors
+xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
+       park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
+xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started
+       to stir the soup. She stirred and stirred until it was all mixed together. ...
+xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with
+       a swing. Lily wanted to play on the swing, but she was scared. Her mom said,
+```
+
+**token-match**：xserv（BF16）对 xtrain 自身贪心（F32），3 个 prompt 中 **2 个逐 token 完全一致**
+（"Once upon a time"、"One day"）；第 3 个（"The little"）在 "so excited." 之后分叉（xtrain 续
+"She wanted to try it out…"、xserv 续 "She ran to the kitchen…"）——单个 logit 微差翻转贪心取值后序列发散，
+与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v3（67M core）规模仍成立（多数 prompt 逐 token 一致，少数
+因 BF16 末端分叉）。
+
+## v4 提案
+
+v3 的 val 曲线一路单调下到末步（无过拟合）= 仍**欠拟合**，更大模型 / 更多 token 还能降。建议 v4：
+
+- **模型**：dim 640–768 / 20–24 层 / ffn 2560–3072 → core ≈ **130–200M**（容量 ×2–3）。词表不变 →
+  在 dim768 下 embed+lm_head ~77M。
+- **数据/步数**：把训练 token 从 245.8M 拉到 ~600M–1B（开始进入 TinyStories 多 epoch 区，或按数据阶梯
+  混入更广高质语料），目标 val 降到 ~1.0–1.1。
+- **开放杠杆（按需启用）**：
+  - **KI-5（DDP all-reduce 未分桶）**：若 v4 想回到多卡，先做分桶 / overlapped all-reduce，否则大模型
+    全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。
+  - **KI-2/KI-3（bf16 fp32-master / 激活重计算）**：模型变大后显存与算力压力上来，bf16 混合精度 +
+    重计算开始有明显收益（v0–v3 tiny 规模延后了，理由见 docs/06）。
+  - **KI-4（大词表占比）**：dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%；core 继续放大会摊薄
+    占比，但若要更高效，可考虑换更小/更贴合的 tokenizer。
+  - **数据阶梯**：core 到 ~100M+ 后 TinyStories 趋于容量上限，v4 是开始**广化语料**（TinyStories +
+    部分通用高质语料）的合适节点，同步评估 tokenizer。
+
+阶梯已参数化，v4 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可，多卡再叠 DDP（需先修 KI-5）。
+</content>