Files
xtrain/docs/runs/03-v3-tinystories-dim512.md
Gahow Wang a78502e0f0 docs: run v3 — TinyStories, dim512, val 1.30
Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:37:45 +08:00

13 KiB
Raw Blame History

Scaling Run v3: TinyStories + dim512/16L + 单卡 batched(T10) — Design Document

Goal

在 v2dim384/12L、core 28.32M、训 ~37M token、DDP 4 卡)之上,沿模型 + 数据两个轴继续放大,并把 训练改回单卡——这次单卡不是退步,而是 T10 batched forward 落地后的正确选择:

  1. 模型放大dim 384→512、层 12→16、头 12→16transformer core 做到 ~67M 参(容量 ×2.4 词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim512 下固定加 ~51.46M,单列出来。
  2. 数据放大v2 只训了 ~37M token仍欠拟合val 一路降到末步v3 训 245.8M token×6.7 仍复用 v1 缓存的全量 TinyStories token-id 流468M token~0.53 epoch、无重复
  3. 单卡 batchedT10避开 KI-5v2 暴露的 KI-1DDP 弱扩展)根因被 T10 证伪并修复——真正瓶颈 不是 all-reduce而是单序列 forward 的逐 op launch 开销GPU util 0-15%。T10 把 forward 改成 [B·S, dim] flatten linears + fused batched causal SDPAcuBLAS strided-batchedattn 仅 3 launch 单卡吞吐 1653→25627 tok/s15.5×、util 升到 37-54%。v3 因此单卡就够快,避开了 DDP 尚未分桶 all-reduce 的 KI-5多卡才需要
  4. 训完存 registry~/projects/tiny-models/v3-tinystories-dim512/+ 导出 xserv 格式验证可服务,给出 相比 v2 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。

这一版的工程意义:在真实 scaling 规模67M core / 245.8M token / 2.65h)验证了 T10 的 batched forward——既是吞吐基础单卡 ~26K tok/s ≈ KI-1 时代 4 卡 DDP 的 7×也保持了数值正确batched == 单序列等价、grad-check 全绿、xserv 闭环成立。v3 全程单卡,因此完全不触发 KI-5。

数据

v2 v3
来源 TinyStories 全量 train(复用 v1 缓存)
token 数(语料) 468,260,367
训练消费 token ~36.9M4500 步 × 8192 ~245.8M30000 步 × 8192
epoch 占比 ~0.08 ~0.53(仍 <1 epoch无重复
tokenizer gpt2 BPEvocab 50257
缓存 data/tinystories-train.txt.u16.binu16936MB 直接复用
held-out val 全量末尾 1,000,000 token 同一 1M token(与 v0/v1/v2 完全相同的保留集,公平对比)

复用缓存Corpus::load_cached<corpus>.u16.bin,启动即载入 467.26M train token末尾 1M 留 val。 held-out val 仍是全量末尾 1M tokensplit_tail),与 v0/v1/v2 同一保留集——v0v3 的 val loss 直接可比

数据阶梯v3 仍喂 TinyStories~0.53 epoch未榨满模型也仍在 tiny-LM 范围core 已到 67M距 TinyStories 成为容量上限(需 ~100M+ core尚有余量故 v3 不换语料。待 core 进一步放大后再按数据阶梯 上更广高质语料。

架构

v3 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA forward 图与 v0/v1/v2 完全同构,只是 dims 变大。无结构改动

维度 v2 v3
dim= heads·head_dim 384 512
n_layers 12 16
n_heads 12 16
head_dim 32 32
ffn_hiddenSwiGLU 1536 2048
vocab 50257 50257
core 参数(除 embed+lm_head 28,322,304≈28.32M 67,127,296≈67.13M×2.37
embed + lm_head2×vocab×dim 38,597,376≈38.60M 51,463,168≈51.46M
总参数 66,919,680≈66.92M 118,590,464≈118.59M

core 的量法Config::core_params() = num_params() 2·vocab·dim。gpt2 50257 vocab 在 dim512 下让 embedding + lm_head 固定占 ~51.46M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 corev3 core 67.13M。注意v3 总参 118.59M 里 embed/lm_head 仍占 ~43%51.46M),是 gpt2 大词表占比问题 (见 docs/known-issues.md KI-4——dim 越大占比越降,但在 dim512 仍是近一半参数。

相比 v2 的架构变化纯放大dim 384→512 / 层 12→16 / 头 12→16 / ffn 1536→2048无结构改动。 阶梯已参数化v3 只改 --dim/--heads/--layers/--ffn/--steps flag不动模型代码。

训练器:单卡 batchedT10

v2 用 DDPT84 卡,因 global_batch=32 太小被 KI-1all-reduce 占比过高压住扩展性。T10 排查后发现 KI-1 的前提被证伪v2 时代单卡只有 ~1653 tok/s 的真因不是通信,而是单序列 forward 每个 op 各自 launchGPU 长期空转util 0-15%。T10 的修复:

  • flatten linears:把 [B][S,dim] 的逐序列 matmul 合成 [B·S, dim] @ W 一次大 GEMM。
  • fused batched causal SDPA:用 cublasStridedBatched 做 QKᵀ / softmax·V整个 attention 3 个 launch(而非 per-seq per-op
  • RoPE per-seqpos = row % Sbatch flatten 后按序列内行号给位置)。

效果docs/09-batched-forward.md单卡 1653→25627 tok/s15.5×、batch32 时 ~40K tok/s24×、 util 0-15%→37-54%。全闸门绿15 grad-check / PyTorch B>1 对拍 / batched == 单序列等价 / overfit / DDP 一致 / xserv 闭环。v3 因此单卡训练~26K tok/s ≈ v2 DDP 4 卡(~3.6K tok/s7×,且不触发 KI-5DDP all-reduce 未分桶,只有重回多卡才需要)。

超参

备注
optimizer 手写 AdamWGPU 端 step wd=0.1,β/eps 用 xtrain-optim 默认
LR schedule 线性 warmup → cosine decay max_lr 6e-4 → min_lr 6e-5(同 v1/v2
warmup 1500 步steps/20
grad clip global-norm 1.0 gnorm 全程 ~0.350.5,平稳
steps 30000 ~2.65 小时 @ 单卡
batch 32 单卡 batchedT10一次 forward 吃 32 序列,非多次 SUM
seq_len 256 同 v2
tokens/step 32×256 = 8192 总训练 token ≈ 245.8M~0.53 epoch
world size 1(单卡 RTX 5090sm_120 避开 KI-5
精度 f32训练 导出 xserv 时转 BF16见 T9

算力dash5 单卡 RTX 5090全程 ~26,000 tok/s启动 ~28K稳态 ~26Kwall-clock ≈ 2.65 小时

结果

  • train lossstart 10.9118 → end ~1.40(末批 1.3993;全程平稳下降)
  • best / final val lossheld-out 1M tokenstep 299991.3027超 ~1.4 目标
  • val loss 曲线(每 3000 步抽样,单调下降、末步仍在降、无过拟合
step 999 2999 5999 8999 11999 14999 17999 20999 23999 26999 29999
val 2.5205 1.8738 1.6878 1.5757 1.5080 1.4594 1.4077 1.3688 1.3389 1.3163 1.3027

val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据或更大模型还能继续降v4 杠杆)。

采样greedyxtrain 直采,同 prompt

[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
                     outside in the park. One day, she saw a big, scary dog. The dog barked
                     loudly and scared her. She ran
[The little]       → The little girl was so excited. She wanted to try it out. She asked her mom
                     if she could go outside and play. Her mom said yes, so the little girl went
                     outside. The little girl
[One day]          → One day, a little girl named Lily went to the park with her mom. They saw a
                     big tree with a swing. Lily wanted to play on the swing, but she was scared.
                     Her mom said,

温度 0.8 采样同样连贯(多角色、完整情节,如 Lily 摔坏 cushion 而哭、在书里发现新天地),见 RUN.md / 训练日志。

相比 v2 的提升

best val loss各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)

模型 core 参数 训练 token best val loss 说明
v0-baseline 41K ~0.72M 3.8050 3MB 切片,采样退化循环
v1 8.39M ~5.1M 2.5847 全量数据 + dim256/8L单卡
v2 28.32M×3.37 ~36.9M×7.2 1.7055 dim384/12L + DDPval 比 v1 低 0.88
v3 67.13M×2.37 vs v2 ~245.8M×6.7 vs v2 1.3027 dim512/16L + 单卡 batchedval 比 v2 低 0.40

完整 val 阶梯v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30——每一档都在同一 1M token 保留集上单调下降。

并排采样greedy 40 tokxserv 服务,同 prompt

prompt v2 v3
Once upon a time …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big …a little girl named Lily. She loved to play outside in the park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
One day One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. The little girl was scared and ran away. One day, a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,
The little The little girl was so happy and she thanked the man for his help. She said goodbye and went home with a smile on her face. The little girl was so excited. She wanted to try it out. She asked her mom if she could go outside and play. Her mom said yes, so the little girl went outside.

结论v228.32M core / 37M token已能写多步情节但桥段较套路、收束偏快。v367.13M core / 245.8M token相同开头下展开更具体、更有内部因果的情节(看到狗→狗叫→吓到→逃跑;想玩秋千→但害怕→ 妈妈出声),人物动机与转折更连贯,故事密度进一步提升。best val 1.71→1.30(低 0.40+ 采样从"多步情节" 到"带动机/转折的连续叙事"v3 是相对 v2 的清晰、可量化提升。

xserv 验证

导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 docs/08 179 tensors = 16 层 × 11 + embed + norm + lm_head存入 registry 后用 xserv-cli 加载并贪心生成:

$ xserv-cli ~/projects/tiny-models/v3-tinystories-dim512 --max-tokens 40
Model: qwen3, layers=16, hidden=512, heads=16/16 kv, vocab=50257
Loaded 179 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
       park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran
xserv> The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started
       to stir the soup. She stirred and stirred until it was all mixed together. ...
xserv> One day, a little girl named Lily went to the park with her mom. They saw a big tree with
       a swing. Lily wanted to play on the swing, but she was scared. Her mom said,

token-matchxservBF16对 xtrain 自身贪心F323 个 prompt 中 2 个逐 token 完全一致 "Once upon a time"、"One day");第 3 个("The little")在 "so excited." 之后分叉xtrain 续 "She wanted to try it out…"、xserv 续 "She ran to the kitchen…")——单个 logit 微差翻转贪心取值后序列发散, 与 v1/v2 观察到的 ~0.5% BF16 漂移同源。闭环在 v367M core规模仍成立多数 prompt 逐 token 一致,少数 因 BF16 末端分叉)。

v4 提案

v3 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,更大模型 / 更多 token 还能降。建议 v4

  • 模型dim 640768 / 2024 层 / ffn 25603072 → core ≈ 130200M(容量 ×23。词表不变 → 在 dim768 下 embed+lm_head ~77M。
  • 数据/步数:把训练 token 从 245.8M 拉到 ~600M1B开始进入 TinyStories 多 epoch 区,或按数据阶梯 混入更广高质语料),目标 val 降到 ~1.01.1。
  • 开放杠杆(按需启用)
    • KI-5DDP all-reduce 未分桶):若 v4 想回到多卡,先做分桶 / overlapped all-reduce否则大模型 全参数单次 all-reduce 又会吃扩展性。v3 单卡刻意避开了它。
    • KI-2/KI-3bf16 fp32-master / 激活重计算)模型变大后显存与算力压力上来bf16 混合精度 + 重计算开始有明显收益v0v3 tiny 规模延后了,理由见 docs/06
    • KI-4大词表占比dim512 时 embed/lm_head 仍占 51.46M / 118.59M ≈ 43%core 继续放大会摊薄 占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
    • 数据阶梯core 到 ~100M+ 后 TinyStories 趋于容量上限v4 是开始广化语料TinyStories + 部分通用高质语料)的合适节点,同步评估 tokenizer。

阶梯已参数化v4 改 --dim/--heads/--layers/--ffn/--steps flag 即可,多卡再叠 DDP需先修 KI-5