Files
xtrain/docs/runs/04-v4-tinystories-dim768.md
Gahow Wang ff79fee3c5 docs: run v4 — TinyStories, dim768, val 1.17
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:37 +08:00

14 KiB
Raw Blame History

Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document

Goal

在 v3dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched之上沿模型 + 数据两个轴继续 放大,并把训练重回多卡——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP而是 T11 缓存分配器落地、 8 卡近线性之后的正确选择:

  1. 模型放大dim 512→768、层 16→18、头 16→24head_dim 仍 32transformer core 做到 ~127M 参(容量 ×1.9),词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M,单列出来。
  2. 数据放大v3 训了 ~245.8M token~0.53 epoch仍欠拟合val 一路降到末步v4 训 720.9M token×2.9),仍复用 v1 缓存的全量 TinyStories token-id 流468M token~1.54 epoch——首次 越过 1 epoch、开始进入 TinyStories 多遍区。
  3. 8 卡 DDP fp32T11多卡近线性v2 暴露的 KI-1DDP 弱扩展)根因被 T10/T11 逐层证伪并修复—— T10 修了单序列 launch-boundT11 的 per-device size-classed caching allocator 进一步把 per-op cudaMalloc 串行消掉,8 卡 49K→461K tok/sscaling 1.3×→5×全 8 卡 95-99% util。v4 因此放心回多卡, 全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。
  4. 训完存 registry~/projects/tiny-models/v4-tinystories-dim768/+ 导出 xserv 格式验证可服务,给出 相比 v3 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。

这一版的工程意义:在真实 scaling 规模127M core / 720.9M token / 84 min验证了 T11 缓存分配器 在 dim768 的多卡扩展性——8 卡全程 ~145K tok/s、95-99% util比 v2 时代 4 卡 DDP~3.6K tok/s快 ~40×。同时 v4 是 bf16KI-2的具体触发点dim768 fp32 在 32GB 显存里 per-rank batch 32global 256 OOM被迫降到 per-rank 16global 128——这是 v0v3 tiny 规模一直延后 bf16 后,第一次有 fp32 放不下的 硬约束。

数据

v3 v4
来源 TinyStories 全量 train(复用 v1 缓存)
token 数(语料) 468,260,367
训练消费 token ~245.8M30000 步 × 8192 ~720.9M22000 步 × 32768
epoch 占比 ~0.53 ~1.54(首次越过 1 epoch
tokenizer gpt2 BPEvocab 50257
缓存 data/tinystories-train.txt.u16.binu16936MB 直接复用
held-out val 全量末尾 1,000,000 token 同一 1M token(与 v0/v1/v2/v3 完全相同的保留集,公平对比)

复用缓存Corpus::load_cached<corpus>.u16.bin,启动即载入 467.26M train token末尾 1M 留 val。 held-out val 仍是全量末尾 1M tokensplit_tail),与 v0v3 同一保留集——v0v4 的 val loss 直接可比

数据阶梯v4 是首次越过 1 epoch~1.54。core 已到 127M、~1.54 epoch 仍欠拟合val 末步还在降), 说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档v5是开始广化语料TinyStories + 部分 通用高质语料)或继续榨 TinyStories 多 epoch 的合适节点,同步评估 tokenizerKI-4

架构

v4 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA forward 图与 v0/v1/v2/v3 完全同构,只是 dims 变大。无结构改动

维度 v3 v4
dim= heads·head_dim 512 768
n_layers 16 18
n_heads 16 24
head_dim 32 32
ffn_hiddenSwiGLU 2048 2048
vocab 50257 50257
core 参数(除 embed+lm_head 67,127,296≈67.13M 127,432,704≈127.43M×1.90
embed + lm_head2×vocab×dim 51,463,168≈51.46M 77,194,752≈77.19M
总参数 118,590,464≈118.59M 204,627,456≈204.63M

core 的量法Config::core_params() = num_params() 2·vocab·dim。gpt2 50257 vocab 在 dim768 下让 embedding + lm_head 固定占 ~77.19M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 corev4 core 127.43M。注意v4 总参 204.63M 里 embed/lm_head 仍占 ~38%77.19M),比 v3 的 43% 略降 dim 越大占比越摊薄),但仍是 gpt2 大词表占比问题(见 docs/known-issues.md KI-4

相比 v3 的架构变化纯放大dim 512→768 / 层 16→18 / 头 16→24head_dim 与 ffn 不变),无结构改动。 阶梯已参数化v4 只改 --dim/--heads/--layers/--ffn/--steps flag不动模型代码。

训练器8 卡 DDP fp32T11 缓存分配器加持)

v2 用 DDPT84 卡,因 global_batch=32 太小被 KI-1all-reduce 占比过高压住扩展性。T10/T11 排查后把 KI-1 的前提逐层证伪并修掉:

  • T10batched forwardv2 时代单卡慢的真因不是通信,而是单序列 forward 每个 op 各自 launch util 0-15%。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
  • T11caching allocatorprofile 证伪「分桶 all-reduce」只占 7%)→ 真因是 per-op cudaMalloc 串行。per-device size-classed caching allocatorDrop 归还、线程安全)→ 单卡 40K→93K tok/s2.3×)、 8 卡 49K→461K tok/s9.4×scaling 1.3×→5×全 8 卡 95-99% util

v4 因此放心回 8 卡 DDP fp32thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step 跨 rank 参数 bit-identical。全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token比 v2 时代 4 卡 DDP ~3.6K tok/s快 ~40×

⚠️ batch 约束bf16 触发点)dim768 fp32 在单卡 32GB 显存里 per-rank batch 32global 256OOM 被迫降到 per-rank 16global 128。这是 v0v3 tiny 规模一直把 bf16KI-2延后后第一次有 fp32 放不下的硬约束——bf16激活减半能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。

超参

备注
optimizer 手写 AdamWGPU 端 step wd=0.1,β/eps 用 xtrain-optim 默认
LR schedule 线性 warmup → cosine decay max_lr 6e-4 → min_lr 6e-5(同 v1/v2/v3
warmup 1100 步steps/20lr 在 step 1100 达峰 6.00e-4
grad clip global-norm 1.0 gnorm 全程 ~0.200.21,平稳
steps 22000 ~84 min @ 8 卡
global batch 128per-rank 16 × world 8 8 卡 DDPper-rank 32 会 OOM见上
seq_len 256 同 v2/v3
tokens/step 128×256 = 32768 总训练 token ≈ 720.9M~1.54 epoch
world size 8RTX 5090sm_120 T11 修 KI-5 后多卡近线性
精度 f32训练 导出 xserv 时转 BF16见 T9

算力dash5 8× RTX 5090全程 ~144,650 tok/s启动即 ~13万、step 50 起稳态 ~14.5万wall-clock ≈ 84 分钟

结果

  • train lossstart 11.0689 → end ~1.14(末批 1.1432;全程平稳下降)
  • best / final val lossheld-out 1M tokenstep 219991.1690(接近 ~1.0-1.1 目标)
  • val loss 曲线(每 ~2000 步抽样,单调下降、末步仍在降、无过拟合
step 499 1999 3999 5999 7999 9999 11999 13999 15999 17999 19999 21999
val 2.5217 1.6493 1.4875 1.4056 1.3571 1.3161 1.2697 1.2414 1.2177 1.1978 1.1762 1.1690

val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据或更大模型还能继续降v5 杠杆)。

采样greedyxtrain 直采,同 prompt

[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
                     outside in the sunshine. One day, she saw a big, scary dog. The dog barked
                     loudly and Lily got scared. She
[One day]          → One day, a little girl named Lily went to the park with her mom. She saw a
                     big tree with a swing. Lily wanted to play on the swing, but she was too
                     small. She asked her
[The little]       → The little girl was so happy that she had found the perfect place to hide.
                     She stayed there for a long time, until it was time to go home. She said
                     goodbye to the tree and ran back home

相比 v3 的提升

best val loss各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)

模型 core 参数 训练 token best val loss 说明
v0-baseline 41K ~0.72M 3.8050 3MB 切片,采样退化循环
v1 8.39M ~5.1M 2.5847 全量数据 + dim256/8L单卡
v2 28.32M ~36.9M 1.7055 dim384/12L + DDP 4 卡
v3 67.13M ~245.8M~0.53 ep 1.3027 dim512/16L + 单卡 batchedval 比 v2 低 0.40
v4 127.43M×1.90 vs v3 ~720.9M×2.9 vs v3~1.54 ep 1.1690 dim768/18L + 8 卡 DDP fp32val 比 v3 低 0.13

完整 val 阶梯v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17——每一档都在同一 1M token 保留集上 单调下降。注意从 v3→v4 的 val 降幅0.13)小于 v2→v30.40边际收益递减是预期的loss 越低越难再降), 且 v4 仍欠拟合(末步还在降),说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。

并排采样greedy 40 tokxserv 服务,同 prompt

prompt v3 v4
Once upon a time …a little girl named Lily. She loved to play outside in the park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
One day One day, a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said, One day, a little girl named Lily went to the park with her mom. She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her
The little The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together. The little girl was so happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home

结论v367M core / 245.8M token已能写带动机/转折的连续叙事v4127M core / 720.9M token / ~1.54 epoch相同开头下情节更具体、动机更细("too small" 而非泛泛 "scared"、"perfect place to hide → stayed → said goodbye → ran back home" 的完整起承转合),收束更自然。best val 1.30→1.17 (低 0.13+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事"v4 是相对 v3 的清晰、可量化提升。

xserv 验证

导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 docs/08 201 tensors = 18 层 × 11 + embed + norm + lm_head存入 registry 后用 xserv-cli 加载并贪心生成:

$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
       sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
       a swing. Lily wanted to play on the swing, but she was too small. She asked her
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
       for a long time, until it was time to go home. She said goodbye to the tree and ran back home

token-matchxservBF16对 xtrain 自身贪心F323 个 prompt 全部逐 token 完全一致40 tok 内零分叉)——比 v32/3 一致闭环更紧。BF16 漂移在 v4127M core规模、40 tok 长度内仍未翻转任何贪心 取值,闭环成立。

v5 提案

v4 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,更大模型 / 更多 token / 更广语料还能降。建议 v5

  • bf16KI-2现已触发v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16 混合精度fp32 master激活减半即可把 batch-256 甜点区找回throughput 进一步↑、收敛更稳),这是 v5 最该先拉的杠杆。
  • 数据v4 才 ~1.54 epoch 且仍欠拟合,更多 TinyStories token(多跑几个 epoch大概率还能降 val 同时 core 已 127M是按数据阶梯开始广化语料TinyStories + 部分通用高质语料)的合适节点。两条都值得, 先靠多 epoch TinyStories 验证「是否数据上限」,再决定是否换语料。
  • 开放杠杆(按需启用)
    • process-per-GPU更高 8 卡线性)v4 8 卡 ~145K tok/s 已近线性,但残留 ~7% all-reduce + PCIe若 v5 想把 8 卡推到更高线性,可从单进程 thread-per-GPU 改 process-per-GPU。
    • KI-4大词表占比dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%;继续放大 core 会摊薄 占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。

阶梯已参数化v5 改 --dim/--heads/--layers/--ffn/--steps flag 即可bf16 落地后 fp32/bf16 双路径并存 pool 已 dtype-agnostic可干净叠加见 T12 backlog