Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep / arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128 per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8 DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU / v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32 batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table to v0/v1/v2/v3/v4 and the next-rung proposal to v5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
14 KiB
Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document
Goal
在 v3(dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched)之上,沿模型 + 数据两个轴继续 放大,并把训练重回多卡——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP,而是 T11 缓存分配器落地、 8 卡近线性之后的正确选择:
- 模型放大:dim 512→768、层 16→18、头 16→24(head_dim 仍 32),把 transformer core 做到 ~127M 参(容量 ×1.9),词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M,单列出来。
- 数据放大:v3 训了 ~245.8M token(~0.53 epoch,仍欠拟合,val 一路降到末步);v4 训 720.9M token(×2.9),仍复用 v1 缓存的全量 TinyStories token-id 流(468M token),~1.54 epoch——首次 越过 1 epoch、开始进入 TinyStories 多遍区。
- 8 卡 DDP fp32(T11,多卡近线性):v2 暴露的 KI-1(DDP 弱扩展)根因被 T10/T11 逐层证伪并修复——
T10 修了单序列 launch-bound,T11 的 per-device size-classed caching allocator 进一步把 per-op
cudaMalloc串行消掉,8 卡 49K→461K tok/s(scaling 1.3×→5×,全 8 卡 95-99% util)。v4 因此放心回多卡, 全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。 - 训完存 registry(
~/projects/tiny-models/v4-tinystories-dim768/)+ 导出 xserv 格式验证可服务,给出 相比 v3 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。
这一版的工程意义:在真实 scaling 规模(127M core / 720.9M token / 84 min)验证了 T11 缓存分配器 在 dim768 的多卡扩展性——8 卡全程 ~145K tok/s、95-99% util,比 v2 时代 4 卡 DDP(~3.6K tok/s)快 ~40×。同时 v4 是 bf16(KI-2)的具体触发点:dim768 fp32 在 32GB 显存里 per-rank batch 32(global 256) OOM,被迫降到 per-rank 16(global 128)——这是 v0–v3 tiny 规模一直延后 bf16 后,第一次有 fp32 放不下的 硬约束。
数据
| 项 | v3 | v4 |
|---|---|---|
| 来源 | TinyStories 全量 train(复用 v1 缓存) | 同 |
| token 数(语料) | 468,260,367 | 同 |
| 训练消费 token | ~245.8M(30000 步 × 8192) | ~720.9M(22000 步 × 32768) |
| epoch 占比 | ~0.53 | ~1.54(首次越过 1 epoch) |
| tokenizer | gpt2 BPE(vocab 50257) | 同 |
| 缓存 | data/tinystories-train.txt.u16.bin(u16,936MB) |
直接复用 |
| held-out val | 全量末尾 1,000,000 token | 同一 1M token(与 v0/v1/v2/v3 完全相同的保留集,公平对比) |
复用缓存:Corpus::load_cached 读 <corpus>.u16.bin,启动即载入 467.26M train token(末尾 1M 留 val)。
held-out val 仍是全量末尾 1M token(split_tail),与 v0–v3 同一保留集——v0–v4 的 val loss 直接可比。
数据阶梯:v4 是首次越过 1 epoch(~1.54)。core 已到 127M、~1.54 epoch 仍欠拟合(val 末步还在降), 说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档(v5)是开始广化语料(TinyStories + 部分 通用高质语料)或继续榨 TinyStories 多 epoch 的合适节点,同步评估 tokenizer(KI-4)。
架构
v4 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA), forward 图与 v0/v1/v2/v3 完全同构,只是 dims 变大。无结构改动。
| 维度 | v3 | v4 |
|---|---|---|
| dim(= heads·head_dim) | 512 | 768 |
| n_layers | 16 | 18 |
| n_heads | 16 | 24 |
| head_dim | 32 | 32 |
| ffn_hidden(SwiGLU) | 2048 | 2048 |
| vocab | 50257 | 50257 |
| core 参数(除 embed+lm_head) | 67,127,296(≈67.13M) | 127,432,704(≈127.43M,×1.90) |
| embed + lm_head(2×vocab×dim) | 51,463,168(≈51.46M) | 77,194,752(≈77.19M) |
| 总参数 | 118,590,464(≈118.59M) | 204,627,456(≈204.63M) |
core 的量法:Config::core_params() = num_params() − 2·vocab·dim。gpt2 50257 vocab 在 dim768 下让
embedding + lm_head 固定占 ~77.19M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 core 量
(v4 core 127.43M)。注意:v4 总参 204.63M 里 embed/lm_head 仍占 ~38%(77.19M),比 v3 的 43% 略降
(dim 越大占比越摊薄),但仍是 gpt2 大词表占比问题(见 docs/known-issues.md KI-4)。
相比 v3 的架构变化:纯放大(dim 512→768 / 层 16→18 / 头 16→24,head_dim 与 ffn 不变),无结构改动。
阶梯已参数化,v4 只改 --dim/--heads/--layers/--ffn/--steps flag,不动模型代码。
训练器:8 卡 DDP fp32(T11 缓存分配器加持)
v2 用 DDP(T8)4 卡,因 global_batch=32 太小被 KI-1(all-reduce 占比过高)压住扩展性。T10/T11 排查后把 KI-1 的前提逐层证伪并修掉:
- T10(batched forward):v2 时代单卡慢的真因不是通信,而是单序列 forward 每个 op 各自 launch (util 0-15%)。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
- T11(caching allocator):profile 证伪「分桶 all-reduce」(只占 7%)→ 真因是 per-op
cudaMalloc串行。per-device size-classed caching allocator(Drop 归还、线程安全)→ 单卡 40K→93K tok/s(2.3×)、 8 卡 49K→461K tok/s(9.4×,scaling 1.3×→5×,全 8 卡 95-99% util)。
v4 因此放心回 8 卡 DDP fp32:thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step, 跨 rank 参数 bit-identical。全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token,比 v2 时代 4 卡 DDP (~3.6K tok/s)快 ~40×。
⚠️ batch 约束(bf16 触发点):dim768 fp32 在单卡 32GB 显存里 per-rank batch 32(global 256)OOM, 被迫降到 per-rank 16(global 128)。这是 v0–v3 tiny 规模一直把 bf16(KI-2)延后后,第一次有 fp32 放不下的硬约束——bf16(激活减半)能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。
超参
| 项 | 值 | 备注 |
|---|---|---|
| optimizer | 手写 AdamW(GPU 端 step) | wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr 6e-4 → min_lr 6e-5(同 v1/v2/v3) |
| warmup | 1100 步(steps/20,lr 在 step 1100 达峰 6.00e-4) | |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.20–0.21,平稳 |
| steps | 22000 | ~84 min @ 8 卡 |
| global batch | 128(per-rank 16 × world 8) | 8 卡 DDP;per-rank 32 会 OOM(见上) |
| seq_len | 256 | 同 v2/v3 |
| tokens/step | 128×256 = 32768 | 总训练 token ≈ 720.9M(~1.54 epoch) |
| world size | 8(RTX 5090,sm_120) | T11 修 KI-5 后多卡近线性 |
| 精度 | f32(训练) | 导出 xserv 时转 BF16(见 T9) |
算力:dash5 8× RTX 5090,全程 ~144,650 tok/s(启动即 ~13万、step 50 起稳态 ~14.5万),wall-clock ≈ 84 分钟。
结果
- train loss:start 11.0689 → end ~1.14(末批 1.1432;全程平稳下降)
- best / final val loss(held-out 1M token,step 21999):1.1690(接近 ~1.0-1.1 目标)
- val loss 曲线(每 ~2000 步抽样,单调下降、末步仍在降、无过拟合):
| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 11999 | 13999 | 15999 | 17999 | 19999 | 21999 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| val | 2.5217 | 1.6493 | 1.4875 | 1.4056 | 1.3571 | 1.3161 | 1.2697 | 1.2414 | 1.2177 | 1.1978 | 1.1762 | 1.1690 |
val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据(或更大模型)还能继续降(v5 杠杆)。
采样(greedy,xtrain 直采,同 prompt)
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the sunshine. One day, she saw a big, scary dog. The dog barked
loudly and Lily got scared. She
[One day] → One day, a little girl named Lily went to the park with her mom. She saw a
big tree with a swing. Lily wanted to play on the swing, but she was too
small. She asked her
[The little] → The little girl was so happy that she had found the perfect place to hide.
She stayed there for a long time, until it was time to go home. She said
goodbye to the tree and ran back home
相比 v3 的提升
best val loss(各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集):
| 模型 | core 参数 | 训练 token | best val loss | 说明 |
|---|---|---|---|---|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
| v2 | 28.32M | ~36.9M | 1.7055 | dim384/12L + DDP 4 卡 |
| v3 | 67.13M | ~245.8M(~0.53 ep) | 1.3027 | dim512/16L + 单卡 batched,val 比 v2 低 0.40 |
| v4 | 127.43M(×1.90 vs v3) | ~720.9M(×2.9 vs v3,~1.54 ep) | 1.1690 | dim768/18L + 8 卡 DDP fp32,val 比 v3 低 0.13 |
完整 val 阶梯:v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17——每一档都在同一 1M token 保留集上 单调下降。注意从 v3→v4 的 val 降幅(0.13)小于 v2→v3(0.40):边际收益递减是预期的(loss 越低越难再降), 且 v4 仍欠拟合(末步还在降),说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。
并排采样(greedy 40 tok,xserv 服务,同 prompt)
| prompt | v3 | v4 |
|---|---|---|
Once upon a time |
…a little girl named Lily. She loved to play outside in the park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran | …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She |
One day |
One day, a little girl named Lily went to the park with her mom. They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said, | One day, a little girl named Lily went to the park with her mom. She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her |
The little |
The little girl was so excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together. | The little girl was so happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home |
结论:v3(67M core / 245.8M token)已能写带动机/转折的连续叙事;v4(127M core / 720.9M token / ~1.54 epoch)在相同开头下情节更具体、动机更细("too small" 而非泛泛 "scared"、"perfect place to hide → stayed → said goodbye → ran back home" 的完整起承转合),收束更自然。best val 1.30→1.17 (低 0.13)+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事",v4 是相对 v3 的清晰、可量化提升。
xserv 验证
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 docs/08,
201 tensors = 18 层 × 11 + embed + norm + lm_head),存入 registry 后用 xserv-cli 加载并贪心生成:
$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
a swing. Lily wanted to play on the swing, but she was too small. She asked her
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
for a long time, until it was time to go home. She said goodbye to the tree and ran back home
token-match:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 全部逐 token 完全一致(40 tok 内零分叉)——比 v3(2/3 一致)闭环更紧。BF16 漂移在 v4(127M core)规模、40 tok 长度内仍未翻转任何贪心 取值,闭环成立。
v5 提案
v4 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,更大模型 / 更多 token / 更广语料还能降。建议 v5:
- bf16(KI-2,现已触发):v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16 混合精度(fp32 master),激活减半即可把 batch-256 甜点区找回(throughput 进一步↑、收敛更稳),这是 v5 最该先拉的杠杆。
- 数据:v4 才 ~1.54 epoch 且仍欠拟合,更多 TinyStories token(多跑几个 epoch)大概率还能降 val; 同时 core 已 127M,是按数据阶梯开始广化语料(TinyStories + 部分通用高质语料)的合适节点。两条都值得, 先靠多 epoch TinyStories 验证「是否数据上限」,再决定是否换语料。
- 开放杠杆(按需启用):
- process-per-GPU(更高 8 卡线性):v4 8 卡 ~145K tok/s 已近线性,但残留 ~7% all-reduce + PCIe;若 v5 想把 8 卡推到更高线性,可从单进程 thread-per-GPU 改 process-per-GPU。
- KI-4(大词表占比):dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%;继续放大 core 会摊薄 占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
阶梯已参数化,v5 改 --dim/--heads/--layers/--ffn/--steps flag 即可;bf16 落地后 fp32/bf16 双路径并存
(pool 已 dtype-agnostic,可干净叠加,见 T12 backlog)。