Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 KiB
Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document
Goal
在 v2(dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡)之上,沿模型 + 数据两个轴继续放大,并把 训练改回单卡——这次单卡不是退步,而是 T10 batched forward 落地后的正确选择:
- 模型放大:dim 384→512、层 12→16、头 12→16,把 transformer core 做到 ~67M 参(容量 ×2.4), 词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M,单列出来。
- 数据放大:v2 只训了 ~37M token(仍欠拟合,val 一路降到末步);v3 训 245.8M token(×6.7), 仍复用 v1 缓存的全量 TinyStories token-id 流(468M token),~0.53 epoch、无重复。
- 单卡 batched(T10,避开 KI-5):v2 暴露的 KI-1(DDP 弱扩展)根因被 T10 证伪并修复——真正瓶颈
不是 all-reduce,而是单序列 forward 的逐 op launch 开销(GPU util 0-15%)。T10 把 forward 改成
[B·S, dim]flatten linears + fused batched causal SDPA(cuBLAS strided-batched,attn 仅 3 launch), 单卡吞吐 1653→25627 tok/s(15.5×)、util 升到 37-54%。v3 因此单卡就够快,避开了 DDP 尚未分桶 all-reduce 的 KI-5(多卡才需要)。 - 训完存 registry(
~/projects/tiny-models/v3-tinystories-dim512/)+ 导出 xserv 格式验证可服务,给出 相比 v2 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。
这一版的工程意义:在真实 scaling 规模(67M core / 245.8M token / 2.65h)验证了 T10 的 batched forward——既是吞吐基础(单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×),也保持了数值正确(batched == 单序列等价、grad-check 全绿、xserv 闭环成立)。v3 全程单卡,因此完全不触发 KI-5。
数据
| 项 | v2 | v3 |
|---|---|---|
| 来源 | TinyStories 全量 train(复用 v1 缓存) | 同 |
| token 数(语料) | 468,260,367 | 同 |
| 训练消费 token | ~36.9M(4500 步 × 8192) | ~245.8M(30000 步 × 8192) |
| epoch 占比 | ~0.08 | ~0.53(仍 <1 epoch,无重复) |
| tokenizer | gpt2 BPE(vocab 50257) | 同 |
| 缓存 | data/tinystories-train.txt.u16.bin(u16,936MB) |
直接复用 |
| held-out val | 全量末尾 1,000,000 token | 同一 1M token(与 v0/v1/v2 完全相同的保留集,公平对比) |
复用缓存:Corpus::load_cached 读 <corpus>.u16.bin,启动即载入 467.26M train token(末尾 1M 留 val)。
held-out val 仍是全量末尾 1M token(split_tail),与 v0/v1/v2 同一保留集——v0–v3 的 val loss 直接可比。
数据阶梯:v3 仍喂 TinyStories(~0.53 epoch,未榨满,模型也仍在 tiny-LM 范围);core 已到 67M,距 TinyStories 成为容量上限(需 ~100M+ core)尚有余量,故 v3 不换语料。待 core 进一步放大后再按数据阶梯 上更广高质语料。
架构
v3 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA), forward 图与 v0/v1/v2 完全同构,只是 dims 变大。无结构改动。
| 维度 | v2 | v3 |
|---|---|---|
| dim(= heads·head_dim) | 384 | 512 |
| n_layers | 12 | 16 |
| n_heads | 12 | 16 |
| head_dim | 32 | 32 |
| ffn_hidden(SwiGLU) | 1536 | 2048 |
| vocab | 50257 | 50257 |
| core 参数(除 embed+lm_head) | 28,322,304(≈28.32M) | 67,127,296(≈67.13M,×2.37) |
| embed + lm_head(2×vocab×dim) | 38,597,376(≈38.60M) | 51,463,168(≈51.46M) |
| 总参数 | 66,919,680(≈66.92M) | 118,590,464(≈118.59M) |
core 的量法:Config::core_params() = num_params() − 2·vocab·dim。gpt2 50257 vocab 在 dim512 下让
embedding + lm_head 固定占 ~51.46M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 core 量
(v3 core 67.13M)。注意:v3 总参 118.59M 里 embed/lm_head 仍占 ~43%(51.46M),是 gpt2 大词表占比问题
(见 docs/known-issues.md KI-4)——dim 越大占比越降,但在 dim512 仍是近一半参数。
相比 v2 的架构变化:纯放大(dim 384→512 / 层 12→16 / 头 12→16 / ffn 1536→2048),无结构改动。
阶梯已参数化,v3 只改 --dim/--heads/--layers/--ffn/--steps flag,不动模型代码。
训练器:单卡 batched(T10)
v2 用 DDP(T8)4 卡,因 global_batch=32 太小被 KI-1(all-reduce 占比过高)压住扩展性。T10 排查后发现 KI-1 的前提被证伪:v2 时代单卡只有 ~1653 tok/s 的真因不是通信,而是单序列 forward 每个 op 各自 launch(GPU 长期空转,util 0-15%)。T10 的修复:
- flatten linears:把
[B][S,dim]的逐序列 matmul 合成[B·S, dim] @ W一次大 GEMM。 - fused batched causal SDPA:用
cublasStridedBatched做 QKᵀ / softmax·V,整个 attention 3 个 launch(而非 per-seq per-op)。 - RoPE per-seq:pos =
row % S(batch flatten 后按序列内行号给位置)。
效果(docs/09-batched-forward.md):单卡 1653→25627 tok/s(15.5×)、batch32 时 ~40K tok/s(24×)、 util 0-15%→37-54%。全闸门绿(15 grad-check / PyTorch B>1 对拍 / batched == 单序列等价 / overfit / DDP 一致 / xserv 闭环)。v3 因此单卡训练:~26K tok/s ≈ v2 DDP 4 卡(~3.6K tok/s)的 7×,且不触发 KI-5(DDP all-reduce 未分桶,只有重回多卡才需要)。
超参
| 项 | 值 | 备注 |
|---|---|---|
| optimizer | 手写 AdamW(GPU 端 step) | wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr 6e-4 → min_lr 6e-5(同 v1/v2) |
| warmup | 1500 步(steps/20) | |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.35–0.5,平稳 |
| steps | 30000 | ~2.65 小时 @ 单卡 |
| batch | 32 | 单卡 batched(T10:一次 forward 吃 32 序列,非多次 SUM) |
| seq_len | 256 | 同 v2 |
| tokens/step | 32×256 = 8192 | 总训练 token ≈ 245.8M(~0.53 epoch) |
| world size | 1(单卡 RTX 5090,sm_120) | 避开 KI-5 |
| 精度 | f32(训练) | 导出 xserv 时转 BF16(见 T9) |
算力:dash5 单卡 RTX 5090,全程 ~26,000 tok/s(启动 ~28K,稳态 ~26K),wall-clock ≈ 2.65 小时。
结果
- train loss:start 10.9118 → end ~1.40(末批 1.3993;全程平稳下降)
- best / final val loss(held-out 1M token,step 29999):1.3027(超 ~1.4 目标)
- val loss 曲线(每 3000 步抽样,单调下降、末步仍在降、无过拟合):
| step | 999 | 2999 | 5999 | 8999 | 11999 | 14999 | 17999 | 20999 | 23999 | 26999 | 29999 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| val | 2.5205 | 1.8738 | 1.6878 | 1.5757 | 1.5080 | 1.4594 | 1.4077 | 1.3688 | 1.3389 | 1.3163 | 1.3027 |
val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据(或更大模型)还能继续降(v4 杠杆)。
采样(greedy,xtrain 直采,同 prompt)
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the park. One day, she saw a big, scary dog. The dog barked
loudly and scared her. She ran
[The little] → The little girl was so excited. She wanted to try it out. She asked her mom
if she could go outside and play. Her mom said yes, so the little girl went
outside. The little girl
[One day] → One day, a little girl named Lily went to the park with her mom. They saw a
big tree with a swing. Lily wanted to play on the swing, but she was scared.
Her mom said,
温度 0.8 采样同样连贯(多角色、完整情节,如 Lily 摔坏 cushion 而哭、在书里发现新天地),见 RUN.md /
训练日志。
相比 v2 的提升
best val loss(各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集):
| 模型 | core 参数 | 训练 token | best val loss | 说明 |
|---|---|---|---|---|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
| v2 | 28.32M(×3.37) | ~36.9M(×7.2) | 1.7055 | dim384/12L + DDP,val 比 v1 低 0.88 |
| v3 | 67.13M(×2.37 vs v2) | ~245.8M(×6.7 vs v2) | 1.3027 | dim512/16L + 单卡 batched,val 比 v2 低 0.40 |
完整 val 阶梯:v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30——每一档都在同一 1M token 保留集上单调下降。
并排采样(greedy 40 tok,xserv 服务,同 prompt)
| prompt | v2 | v3 |
|---|---|---|
Once upon a time |
…a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big | …a little girl named Lily. She loved to play outside in the park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran |
One day |
One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. The little girl was scared and ran away. | One day, a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said, |
The little |
The little girl was so happy and she thanked the man for his help. She said goodbye and went home with a smile on her face. | The little girl was so excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside. |
结论:v2(28.32M core / 37M token)已能写多步情节,但桥段较套路、收束偏快。v3(67.13M core / 245.8M token)在相同开头下展开更具体、更有内部因果的情节(看到狗→狗叫→吓到→逃跑;想玩秋千→但害怕→ 妈妈出声),人物动机与转折更连贯,故事密度进一步提升。best val 1.71→1.30(低 0.40)+ 采样从"多步情节" 到"带动机/转折的连续叙事",v3 是相对 v2 的清晰、可量化提升。
xserv 验证
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 docs/08,
179 tensors = 16 层 × 11 + embed + norm + lm_head),存入 registry 后用 xserv-cli 加载并贪心生成:
$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40
Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257
Loaded 179 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started
to stir the soup. She stirred and stirred until it was all mixed together. ...
xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with
a swing. Lily wanted to play on the swing, but she was scared. Her mom said,
token-match:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 中 2 个逐 token 完全一致 ("Once upon a time"、"One day");第 3 个("The little")在 "so excited." 之后分叉(xtrain 续 "She wanted to try it out…"、xserv 续 "She ran to the kitchen…")——单个 logit 微差翻转贪心取值后序列发散, 与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v3(67M core)规模仍成立(多数 prompt 逐 token 一致,少数 因 BF16 末端分叉)。
v4 提案
v3 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,更大模型 / 更多 token 还能降。建议 v4:
- 模型:dim 640–768 / 20–24 层 / ffn 2560–3072 → core ≈ 130–200M(容量 ×2–3)。词表不变 → 在 dim768 下 embed+lm_head ~77M。
- 数据/步数:把训练 token 从 245.8M 拉到 ~600M–1B(开始进入 TinyStories 多 epoch 区,或按数据阶梯 混入更广高质语料),目标 val 降到 ~1.0–1.1。
- 开放杠杆(按需启用):
- KI-5(DDP all-reduce 未分桶):若 v4 想回到多卡,先做分桶 / overlapped all-reduce,否则大模型 全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。
- KI-2/KI-3(bf16 fp32-master / 激活重计算):模型变大后显存与算力压力上来,bf16 混合精度 + 重计算开始有明显收益(v0–v3 tiny 规模延后了,理由见 docs/06)。
- KI-4(大词表占比):dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%;core 继续放大会摊薄 占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
- 数据阶梯:core 到 ~100M+ 后 TinyStories 趋于容量上限,v4 是开始广化语料(TinyStories + 部分通用高质语料)的合适节点,同步评估 tokenizer。
阶梯已参数化,v4 改 --dim/--heads/--layers/--ffn/--steps flag 即可,多卡再叠 DDP(需先修 KI-5)。