Files
xtrain/docs/runs/04-v4-tinystories-dim768.md
Gahow Wang ff79fee3c5 docs: run v4 — TinyStories, dim768, val 1.17
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:37 +08:00

202 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document
## Goal
在 v3dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched之上沿**模型 + 数据**两个轴继续
放大,并把训练**重回多卡**——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP而是 **T11 缓存分配器**落地、
8 卡近线性之后的正确选择:
1. **模型放大**dim 512→768、层 16→18、头 16→24head_dim 仍 32把 **transformer core 做到 ~127M
参**(容量 ×1.9),词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M,单列出来。
2. **数据放大**v3 训了 ~245.8M token~0.53 epoch仍欠拟合val 一路降到末步v4 训 **720.9M
token**×2.9),仍复用 v1 缓存的全量 TinyStories token-id 流468M token**~1.54 epoch**——首次
越过 1 epoch、开始进入 TinyStories 多遍区。
3. **8 卡 DDP fp32T11多卡近线性**v2 暴露的 KI-1DDP 弱扩展)根因被 T10/T11 逐层证伪并修复——
T10 修了单序列 launch-boundT11 的 per-device size-classed caching allocator 进一步把 per-op
`cudaMalloc` 串行消掉,**8 卡 49K→461K tok/sscaling 1.3×→5×全 8 卡 95-99% util**。v4 因此放心回多卡,
全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。
4. 训完存 registry`~/projects/tiny-models/v4-tinystories-dim768/`+ 导出 xserv 格式验证可服务,给出
**相比 v3 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
> **这一版的工程意义**:在真实 scaling 规模127M core / 720.9M token / 84 min**验证了 T11 缓存分配器
> 在 dim768 的多卡扩展性**——8 卡全程 ~145K tok/s、95-99% util比 v2 时代 4 卡 DDP~3.6K tok/s
> ~40×。同时 v4 是 **bf16KI-2的具体触发点**dim768 fp32 在 32GB 显存里 per-rank batch 32global 256
> OOM被迫降到 per-rank 16global 128——这是 v0v3 tiny 规模一直延后 bf16 后,第一次有 fp32 放不下的
> 硬约束。
## 数据
| 项 | v3 | v4 |
|----|----|----|
| 来源 | TinyStories **全量 train**(复用 v1 缓存)| 同 |
| token 数(语料)| 468,260,367 | 同 |
| **训练消费 token** | ~245.8M30000 步 × 8192| **~720.9M**22000 步 × 32768|
| epoch 占比 | ~0.53 | **~1.54**(首次越过 1 epoch|
| tokenizer | gpt2 BPEvocab 50257| 同 |
| 缓存 | `data/tinystories-train.txt.u16.bin`u16936MB| **直接复用** |
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v0/v1/v2/v3 完全相同的保留集,公平对比)|
**复用缓存**`Corpus::load_cached``<corpus>.u16.bin`,启动即载入 467.26M train token末尾 1M 留 val
held-out val 仍是全量末尾 1M token`split_tail`),与 v0v3 同一保留集——**v0v4 的 val loss 直接可比**。
**数据阶梯**v4 是**首次越过 1 epoch**~1.54。core 已到 127M、~1.54 epoch 仍欠拟合val 末步还在降),
说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档v5是开始**广化语料**TinyStories + 部分
通用高质语料)或**继续榨 TinyStories 多 epoch** 的合适节点,同步评估 tokenizerKI-4
## 架构
v4 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA
forward 图与 v0/v1/v2/v3 完全同构,只是 dims 变大。**无结构改动**。
| 维度 | v3 | v4 |
|------|----|----|
| dim= heads·head_dim| 512 | **768** |
| n_layers | 16 | **18** |
| n_heads | 16 | **24** |
| head_dim | 32 | 32 |
| ffn_hiddenSwiGLU| 2048 | 2048 |
| vocab | 50257 | 50257 |
| **core 参数**(除 embed+lm_head| 67,127,296≈67.13M| **127,432,704≈127.43M×1.90** |
| embed + lm_head2×vocab×dim| 51,463,168≈51.46M| 77,194,752≈77.19M|
| **总参数** | 118,590,464≈118.59M| **204,627,456≈204.63M** |
**core 的量法**`Config::core_params() = num_params() 2·vocab·dim`。gpt2 50257 vocab 在 dim768 下让
embedding + lm_head 固定占 ~77.19M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core**
v4 core 127.43M。注意v4 总参 204.63M 里 embed/lm_head 仍占 ~38%77.19M),比 v3 的 43% 略降
dim 越大占比越摊薄),但仍是 gpt2 大词表占比问题(见 docs/known-issues.md KI-4
**相比 v3 的架构变化**纯放大dim 512→768 / 层 16→18 / 头 16→24head_dim 与 ffn 不变),无结构改动。
阶梯已参数化v4 只改 `--dim/--heads/--layers/--ffn/--steps` flag不动模型代码。
## 训练器8 卡 DDP fp32T11 缓存分配器加持)
v2 用 DDPT84 卡,因 global_batch=32 太小被 KI-1all-reduce 占比过高压住扩展性。T10/T11 排查后把
KI-1 的前提逐层证伪并修掉:
- **T10batched forward**v2 时代单卡慢的真因不是通信,而是单序列 forward 每个 op 各自 launch
util 0-15%。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
- **T11caching allocator**profile 证伪「分桶 all-reduce」只占 7%)→ 真因是 per-op `cudaMalloc`
串行。per-device size-classed caching allocatorDrop 归还、线程安全)→ **单卡 40K→93K tok/s2.3×)、
8 卡 49K→461K tok/s9.4×scaling 1.3×→5×全 8 卡 95-99% util**。
v4 因此放心回 8 卡 DDP fp32thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step
跨 rank 参数 bit-identical。全程稳态 **~144,650 tok/s**、~84 min 训完 720.9M token比 v2 时代 4 卡 DDP
~3.6K tok/s快 ~40×
⚠️ **batch 约束bf16 触发点)**dim768 fp32 在单卡 32GB 显存里 **per-rank batch 32global 256OOM**
被迫降到 **per-rank 16global 128**。这是 v0v3 tiny 规模一直把 bf16KI-2延后后第一次有 fp32
放不下的硬约束——bf16激活减半能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。
## 超参
| 项 | 值 | 备注 |
|----|----|----|
| optimizer | 手写 AdamWGPU 端 step| wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1/v2/v3|
| warmup | **1100 步**steps/20lr 在 step 1100 达峰 6.00e-4| |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.200.21,平稳 |
| steps | **22000** | ~84 min @ 8 卡 |
| global batch | **128**per-rank 16 × world 8| **8 卡 DDP**per-rank 32 会 OOM见上|
| seq_len | **256** | 同 v2/v3 |
| tokens/step | 128×256 = 32768 | 总训练 token ≈ **720.9M**~1.54 epoch|
| world size | **8**RTX 5090sm_120| T11 修 KI-5 后多卡近线性 |
| 精度 | f32训练| 导出 xserv 时转 BF16见 T9|
**算力**dash5 8× RTX 5090全程 ~144,650 tok/s启动即 ~13万、step 50 起稳态 ~14.5万wall-clock
**84 分钟**
## 结果
- **train loss**start **11.0689** → end **~1.14**(末批 1.1432;全程平稳下降)
- **best / final val lossheld-out 1M tokenstep 21999****1.1690**(接近 ~1.0-1.1 目标)
- val loss 曲线(每 ~2000 步抽样,单调下降、末步仍在降、**无过拟合**
| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 11999 | 13999 | 15999 | 17999 | 19999 | 21999 |
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|
| val | 2.5217 | 1.6493 | 1.4875 | 1.4056 | 1.3571 | 1.3161 | 1.2697 | 1.2414 | 1.2177 | 1.1978 | 1.1762 | **1.1690** |
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据或更大模型还能继续降v5 杠杆)。
### 采样greedyxtrain 直采,同 prompt
```
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the sunshine. One day, she saw a big, scary dog. The dog barked
loudly and Lily got scared. She
[One day] → One day, a little girl named Lily went to the park with her mom. She saw a
big tree with a swing. Lily wanted to play on the swing, but she was too
small. She asked her
[The little] → The little girl was so happy that she had found the perfect place to hide.
She stayed there for a long time, until it was time to go home. She said
goodbye to the tree and ran back home
```
## 相比 v3 的提升
**best val loss各版各自训练 run 报告的 held-out 1M token 最优值,同一保留集)**
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|------|-----------|-----------|-------------------|------|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L单卡 |
| v2 | 28.32M | ~36.9M | 1.7055 | dim384/12L + DDP 4 卡 |
| v3 | 67.13M | ~245.8M~0.53 ep| 1.3027 | dim512/16L + 单卡 batchedval 比 v2 低 0.40 |
| v4 | 127.43M**×1.90** vs v3| ~720.9M**×2.9** vs v3~1.54 ep| **1.1690** | dim768/18L + **8 卡 DDP fp32**val 比 v3 低 **0.13** |
**完整 val 阶梯v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17**——每一档都在同一 1M token 保留集上
单调下降。注意从 v3→v4 的 val 降幅0.13)小于 v2→v30.40边际收益递减是预期的loss 越低越难再降),
且 v4 仍欠拟合(末步还在降),说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。
### 并排采样greedy 40 tokxserv 服务,同 prompt
| prompt | v3 | v4 |
|--------|----|----|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She** |
| `One day` | One day, a little girl named Lily went to the park with her mom. **They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** | One day, a little girl named Lily went to the park with her mom. **She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her** |
| `The little` | The little girl was so **excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together.** | The little girl was so **happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home** |
**结论**v367M core / 245.8M token已能写带动机/转折的连续叙事v4127M core / 720.9M token /
~1.54 epoch在**相同开头**下情节更具体、动机更细("too small" 而非泛泛 "scared"、"perfect place to
hide → stayed → said goodbye → ran back home" 的完整起承转合),收束更自然。**best val 1.30→1.17
(低 0.13+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事"**v4 是相对 v3 的清晰、可量化提升。
## xserv 验证
导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 `docs/08`
**201 tensors** = 18 层 × 11 + embed + norm + lm_head存入 registry 后用 `xserv-cli` 加载并贪心生成:
```
$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
a swing. Lily wanted to play on the swing, but she was too small. She asked her
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
for a long time, until it was time to go home. She said goodbye to the tree and ran back home
```
**token-match**xservBF16对 xtrain 自身贪心F32**3 个 prompt 全部逐 token 完全一致**40 tok
内零分叉)——比 v32/3 一致闭环更紧。BF16 漂移在 v4127M core规模、40 tok 长度内仍未翻转任何贪心
取值,闭环成立。
## v5 提案
v4 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,更大模型 / 更多 token / 更广语料还能降。建议 v5
- **bf16KI-2现已触发**v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16
混合精度fp32 master激活减半即可把 batch-256 甜点区找回throughput 进一步↑、收敛更稳),这是 v5
最该先拉的杠杆。
- **数据**v4 才 ~1.54 epoch 且仍欠拟合,**更多 TinyStories token**(多跑几个 epoch大概率还能降 val
同时 core 已 127M是按数据阶梯**开始广化语料**TinyStories + 部分通用高质语料)的合适节点。两条都值得,
先靠多 epoch TinyStories 验证「是否数据上限」,再决定是否换语料。
- **开放杠杆(按需启用)**
- **process-per-GPU更高 8 卡线性)**v4 8 卡 ~145K tok/s 已近线性,但残留 ~7% all-reduce + PCIe
v5 想把 8 卡推到更高线性,可从单进程 thread-per-GPU 改 process-per-GPU。
- **KI-4大词表占比**dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%;继续放大 core 会摊薄
占比,但若要更高效,可考虑换更小/更贴合的 tokenizer。
阶梯已参数化v5 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可bf16 落地后 fp32/bf16 双路径并存
pool 已 dtype-agnostic可干净叠加见 T12 backlog