xtrain/docs/runs/04-v4-tinystories-dim768.md

# Scaling Run v4: TinyStories + dim768/18L + 8 卡 DDP fp32(T11) — Design Document

## Goal

在 v3（dim512/16L、core 67.13M、训 ~245.8M token、单卡 batched）之上，沿**模型 + 数据**两个轴继续
放大，并把训练**重回多卡**——这次多卡不是 v2 时代被 KI-1 压住扩展性的 DDP，而是 **T11 缓存分配器**落地、
8 卡近线性之后的正确选择：

1. **模型放大**：dim 512→768、层 16→18、头 16→24（head_dim 仍 32），把 **transformer core 做到 ~127M
   参**（容量 ×1.9），词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim768 下固定加 ~77.19M，单列出来。
2. **数据放大**：v3 训了 ~245.8M token（~0.53 epoch，仍欠拟合，val 一路降到末步）；v4 训 **720.9M
   token**（×2.9），仍复用 v1 缓存的全量 TinyStories token-id 流（468M token），**~1.54 epoch**——首次
   越过 1 epoch、开始进入 TinyStories 多遍区。
3. **8 卡 DDP fp32（T11，多卡近线性）**：v2 暴露的 KI-1（DDP 弱扩展）根因被 T10/T11 逐层证伪并修复——
   T10 修了单序列 launch-bound，T11 的 per-device size-classed caching allocator 进一步把 per-op
   `cudaMalloc` 串行消掉，**8 卡 49K→461K tok/s（scaling 1.3×→5×，全 8 卡 95-99% util）**。v4 因此放心回多卡，
   全程稳态 ~144,650 tok/s、~84 min 训完 720.9M token。
4. 训完存 registry（`~/projects/tiny-models/v4-tinystories-dim768/`）+ 导出 xserv 格式验证可服务，给出
   **相比 v3 的具体提升**（同一保留集 val loss + 同 prompt 并排采样）。

> **这一版的工程意义**：在真实 scaling 规模（127M core / 720.9M token / 84 min）**验证了 T11 缓存分配器
> 在 dim768 的多卡扩展性**——8 卡全程 ~145K tok/s、95-99% util，比 v2 时代 4 卡 DDP（~3.6K tok/s）快
> ~40×。同时 v4 是 **bf16（KI-2）的具体触发点**：dim768 fp32 在 32GB 显存里 per-rank batch 32（global 256）
> OOM，被迫降到 per-rank 16（global 128）——这是 v0–v3 tiny 规模一直延后 bf16 后，第一次有 fp32 放不下的
> 硬约束。

## 数据

| 项 | v3 | v4 |
|----|----|----|
| 来源 | TinyStories **全量 train**（复用 v1 缓存）| 同 |
| token 数（语料）| 468,260,367 | 同 |
| **训练消费 token** | ~245.8M（30000 步 × 8192）| **~720.9M**（22000 步 × 32768）|
| epoch 占比 | ~0.53 | **~1.54**（首次越过 1 epoch）|
| tokenizer | gpt2 BPE（vocab 50257）| 同 |
| 缓存 | `data/tinystories-train.txt.u16.bin`（u16，936MB）| **直接复用** |
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**（与 v0/v1/v2/v3 完全相同的保留集，公平对比）|

**复用缓存**：`Corpus::load_cached` 读 `<corpus>.u16.bin`，启动即载入 467.26M train token（末尾 1M 留 val）。
held-out val 仍是全量末尾 1M token（`split_tail`），与 v0–v3 同一保留集——**v0–v4 的 val loss 直接可比**。

**数据阶梯**：v4 是**首次越过 1 epoch**（~1.54）。core 已到 127M、~1.54 epoch 仍欠拟合（val 末步还在降），
说明 TinyStories 这本语料对 127M core 尚未到容量上限。下一档（v5）是开始**广化语料**（TinyStories + 部分
通用高质语料）或**继续榨 TinyStories 多 epoch** 的合适节点，同步评估 tokenizer（KI-4）。

## 架构

v4 = 更大、同构的 tiny Qwen3（RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head，MHA），
forward 图与 v0/v1/v2/v3 完全同构，只是 dims 变大。**无结构改动**。

| 维度 | v3 | v4 |
|------|----|----|
| dim（= heads·head_dim）| 512 | **768** |
| n_layers | 16 | **18** |
| n_heads | 16 | **24** |
| head_dim | 32 | 32 |
| ffn_hidden（SwiGLU）| 2048 | 2048 |
| vocab | 50257 | 50257 |
| **core 参数**（除 embed+lm_head）| 67,127,296（≈67.13M）| **127,432,704（≈127.43M，×1.90）** |
| embed + lm_head（2×vocab×dim）| 51,463,168（≈51.46M）| 77,194,752（≈77.19M）|
| **总参数** | 118,590,464（≈118.59M）| **204,627,456（≈204.63M）** |

**core 的量法**：`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim768 下让
embedding + lm_head 固定占 ~77.19M——这两张表是**词表大小**的函数、不是模型容量，所以阶梯按 **core** 量
（v4 core 127.43M）。注意：v4 总参 204.63M 里 embed/lm_head 仍占 ~38%（77.19M），比 v3 的 43% 略降
（dim 越大占比越摊薄），但仍是 gpt2 大词表占比问题（见 docs/known-issues.md KI-4）。

**相比 v3 的架构变化**：纯放大（dim 512→768 / 层 16→18 / 头 16→24，head_dim 与 ffn 不变），无结构改动。
阶梯已参数化，v4 只改 `--dim/--heads/--layers/--ffn/--steps` flag，不动模型代码。

## 训练器：8 卡 DDP fp32（T11 缓存分配器加持）

v2 用 DDP（T8）4 卡，因 global_batch=32 太小被 KI-1（all-reduce 占比过高）压住扩展性。T10/T11 排查后把
KI-1 的前提逐层证伪并修掉：

- **T10（batched forward）**：v2 时代单卡慢的真因不是通信，而是单序列 forward 每个 op 各自 launch
  （util 0-15%）。flatten linears + fused batched causal SDPA → 单卡 1653→25627 tok/s。
- **T11（caching allocator）**：profile 证伪「分桶 all-reduce」（只占 7%）→ 真因是 per-op `cudaMalloc`
  串行。per-device size-classed caching allocator（Drop 归还、线程安全）→ **单卡 40K→93K tok/s（2.3×）、
  8 卡 49K→461K tok/s（9.4×，scaling 1.3×→5×，全 8 卡 95-99% util）**。

v4 因此放心回 8 卡 DDP fp32：thread-per-GPU、all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step，
跨 rank 参数 bit-identical。全程稳态 **~144,650 tok/s**、~84 min 训完 720.9M token，比 v2 时代 4 卡 DDP
（~3.6K tok/s）快 ~40×。

⚠️ **batch 约束（bf16 触发点）**：dim768 fp32 在单卡 32GB 显存里 **per-rank batch 32（global 256）OOM**，
被迫降到 **per-rank 16（global 128）**。这是 v0–v3 tiny 规模一直把 bf16（KI-2）延后后，第一次有 fp32
放不下的硬约束——bf16（激活减半）能把 batch-256 的甜点区找回来。已回填 docs/known-issues.md KI-2 触发点。

## 超参

| 项 | 值 | 备注 |
|----|----|----|
| optimizer | 手写 AdamW（GPU 端 step）| wd=0.1，β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**（同 v1/v2/v3）|
| warmup | **1100 步**（steps/20，lr 在 step 1100 达峰 6.00e-4）| |
| grad clip | global-norm 1.0 | gnorm 全程 ~0.20–0.21，平稳 |
| steps | **22000** | ~84 min @ 8 卡 |
| global batch | **128**（per-rank 16 × world 8）| **8 卡 DDP**；per-rank 32 会 OOM（见上）|
| seq_len | **256** | 同 v2/v3 |
| tokens/step | 128×256 = 32768 | 总训练 token ≈ **720.9M**（~1.54 epoch）|
| world size | **8**（RTX 5090，sm_120）| T11 修 KI-5 后多卡近线性 |
| 精度 | f32（训练）| 导出 xserv 时转 BF16（见 T9）|

**算力**：dash5 8× RTX 5090，全程 ~144,650 tok/s（启动即 ~13万、step 50 起稳态 ~14.5万），wall-clock
≈ **84 分钟**。

## 结果

- **train loss**：start **11.0689** → end **~1.14**（末批 1.1432；全程平稳下降）
- **best / final val loss（held-out 1M token，step 21999）**：**1.1690**（接近 ~1.0-1.1 目标）
- val loss 曲线（每 ~2000 步抽样，单调下降、末步仍在降、**无过拟合**）：

| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 11999 | 13999 | 15999 | 17999 | 19999 | 21999 |
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|
| val  | 2.5217 | 1.6493 | 1.4875 | 1.4056 | 1.3571 | 1.3161 | 1.2697 | 1.2414 | 1.2177 | 1.1978 | 1.1762 | **1.1690** |

val 一路降到末步、无回升 = 仍**欠拟合**，更多步数/数据（或更大模型）还能继续降（v5 杠杆）。

### 采样（greedy，xtrain 直采，同 prompt）

```
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
                     outside in the sunshine. One day, she saw a big, scary dog. The dog barked
                     loudly and Lily got scared. She
[One day]          → One day, a little girl named Lily went to the park with her mom. She saw a
                     big tree with a swing. Lily wanted to play on the swing, but she was too
                     small. She asked her
[The little]       → The little girl was so happy that she had found the perfect place to hide.
                     She stayed there for a long time, until it was time to go home. She said
                     goodbye to the tree and ran back home
```

## 相比 v3 的提升

**best val loss（各版各自训练 run 报告的 held-out 1M token 最优值，同一保留集）**：

| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|------|-----------|-----------|-------------------|------|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片，采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L，单卡 |
| v2 | 28.32M | ~36.9M | 1.7055 | dim384/12L + DDP 4 卡 |
| v3 | 67.13M | ~245.8M（~0.53 ep）| 1.3027 | dim512/16L + 单卡 batched，val 比 v2 低 0.40 |
| v4 | 127.43M（**×1.90** vs v3）| ~720.9M（**×2.9** vs v3，~1.54 ep）| **1.1690** | dim768/18L + **8 卡 DDP fp32**，val 比 v3 低 **0.13** |

**完整 val 阶梯：v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30 / v4 1.17**——每一档都在同一 1M token 保留集上
单调下降。注意从 v3→v4 的 val 降幅（0.13）小于 v2→v3（0.40）：边际收益递减是预期的（loss 越低越难再降），
且 v4 仍欠拟合（末步还在降），说明 127M core 在 TinyStories 上尚未到容量上限——更多 token / 更广语料还有空间。

### 并排采样（greedy 40 tok，xserv 服务，同 prompt）

| prompt | v3 | v4 |
|--------|----|----|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the **park. One day, she saw a big, scary dog. The dog barked loudly and scared her. She ran** | …a little girl named Lily. She loved to play outside in the **sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She** |
| `One day` | One day, a little girl named Lily went to the park with her mom. **They saw a big tree with a swing. Lily wanted to play on the swing, but she was scared. Her mom said,** | One day, a little girl named Lily went to the park with her mom. **She saw a big tree with a swing. Lily wanted to play on the swing, but she was too small. She asked her** |
| `The little` | The little girl was so **excited. She ran to the kitchen and grabbed a spoon. She started to stir the soup. She stirred and stirred until it was all mixed together.** | The little girl was so **happy that she had found the perfect place to hide. She stayed there for a long time, until it was time to go home. She said goodbye to the tree and ran back home** |

**结论**：v3（67M core / 245.8M token）已能写带动机/转折的连续叙事；v4（127M core / 720.9M token /
~1.54 epoch）在**相同开头**下情节更具体、动机更细（"too small" 而非泛泛 "scared"、"perfect place to
hide → stayed → said goodbye → ran back home" 的完整起承转合），收束更自然。**best val 1.30→1.17
（低 0.13）+ 采样从"带动机的叙事"到"细节更具体、结构更完整的小故事"**，v4 是相对 v3 的清晰、可量化提升。

## xserv 验证

导出 HF Qwen3 safetensors（命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16，见 T9 `docs/08`，
**201 tensors** = 18 层 × 11 + embed + norm + lm_head），存入 registry 后用 `xserv-cli` 加载并贪心生成：

```
$ xserv-cli ~/projects/tiny-models/v4-tinystories-dim768 --max-tokens 40
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
       sunshine. One day, she saw a big, scary dog. The dog barked loudly and Lily got scared. She
xserv> One day, a little girl named Lily went to the park with her mom. She saw a big tree with
       a swing. Lily wanted to play on the swing, but she was too small. She asked her
xserv> The little girl was so happy that she had found the perfect place to hide. She stayed there
       for a long time, until it was time to go home. She said goodbye to the tree and ran back home
```

**token-match**：xserv（BF16）对 xtrain 自身贪心（F32），**3 个 prompt 全部逐 token 完全一致**（40 tok
内零分叉）——比 v3（2/3 一致）闭环更紧。BF16 漂移在 v4（127M core）规模、40 tok 长度内仍未翻转任何贪心
取值，闭环成立。

## v5 提案

v4 的 val 曲线一路单调下到末步（无过拟合）= 仍**欠拟合**，更大模型 / 更多 token / 更广语料还能降。建议 v5：

- **bf16（KI-2，现已触发）**：v4 是 bf16 的明确触发点——dim768 fp32 per-rank batch 32 OOM。先上 bf16
  混合精度（fp32 master），激活减半即可把 batch-256 甜点区找回（throughput 进一步↑、收敛更稳），这是 v5
  最该先拉的杠杆。
- **数据**：v4 才 ~1.54 epoch 且仍欠拟合，**更多 TinyStories token**（多跑几个 epoch）大概率还能降 val；
  同时 core 已 127M，是按数据阶梯**开始广化语料**（TinyStories + 部分通用高质语料）的合适节点。两条都值得，
  先靠多 epoch TinyStories 验证「是否数据上限」，再决定是否换语料。
- **开放杠杆（按需启用）**：
  - **process-per-GPU（更高 8 卡线性）**：v4 8 卡 ~145K tok/s 已近线性，但残留 ~7% all-reduce + PCIe；若
    v5 想把 8 卡推到更高线性，可从单进程 thread-per-GPU 改 process-per-GPU。
  - **KI-4（大词表占比）**：dim768 时 embed/lm_head 仍占 77.19M / 204.63M ≈ 38%；继续放大 core 会摊薄
    占比，但若要更高效，可考虑换更小/更贴合的 tokenizer。

阶梯已参数化，v5 改 `--dim/--heads/--layers/--ffn/--steps` flag 即可；bf16 落地后 fp32/bf16 双路径并存
（pool 已 dtype-agnostic，可干净叠加，见 T12 backlog）。