docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
196
docs/runs/02-v2-tinystories-dim384.md
Normal file
196
docs/runs/02-v2-tinystories-dim384.md
Normal file
@@ -0,0 +1,196 @@
|
|||||||
|
# Scaling Run v2: TinyStories + dim384/12L + DDP 多卡 — Design Document
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
在 v1(dim256/8L、core 8.39M、全量 TinyStories 但只训了 ~5.1M token、单卡)之上,沿**模型 + 数据 +
|
||||||
|
并行**三个轴同时放大,做第一次**多卡 DDP 训练**:
|
||||||
|
|
||||||
|
1. **模型放大**:dim 256→384、层 8→12、头 8→12,把 **transformer core 做到 ~28M 参**(容量 ×3.4),
|
||||||
|
词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim384 下固定加 ~38.6M,单列出来。
|
||||||
|
2. **数据放大**:v1 只消费了 ~5.1M token(欠拟合,val 一路降到末步);v2 训 **~37M token**(×7.2),
|
||||||
|
复用 v1 已缓存的全量 TinyStories token-id 流(不重新 tokenize 2GB 语料)。
|
||||||
|
3. **多卡 DDP**:用 T8 的 `xtrain-distributed`(NCCL 数据并行)在 **4 张 RTX 5090** 上训练,把多卡
|
||||||
|
wall-clock 压回 bounded 区间。
|
||||||
|
4. **训练器对齐**:把 T8 的 `train_ddp` 接上 `bin/train` 已有的——参数化 arch / token 缓存 / held-out
|
||||||
|
val 评估 / warmup→cosine / grad-clip / best-val checkpoint——单卡与 DDP 共用一套 eval/checkpoint 逻辑。
|
||||||
|
5. 训完存 registry(`~/projects/tiny-models/v2-tinystories-dim384/`)+ 导出 xserv 格式验证可服务,给出
|
||||||
|
**相比 v1 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
|
||||||
|
|
||||||
|
> 范围(escape hatch 已评估):单序列模型设计(每个 sequence 一次独立 forward、逐 op 启动开销)使
|
||||||
|
> dim384/seq256 下 DDP 全局吞吐 ≈ **3.6K tok/s @ 4 卡**(GPU 利用率偏低,已知瓶颈见 docs/06,且本版
|
||||||
|
> 的小 global batch 又放大了 all-reduce 占比,见 KI-1)。为在共享机上 bounded(~2.8 小时)内拿到
|
||||||
|
> 「清晰、可量化超过 v1」的结果,v2 训 **4500 步 ≈ 37M token**,不追求把 37M 之外榨满——v2 的目的是
|
||||||
|
> 验证 DDP 训练器对齐 + 相对 v1 的明确提升(val<2.2),不是榨满模型。
|
||||||
|
|
||||||
|
## 数据
|
||||||
|
|
||||||
|
| 项 | v1 | v2 |
|
||||||
|
|----|----|----|
|
||||||
|
| 来源 | TinyStories **全量 train** | 同(复用 v1 缓存)|
|
||||||
|
| token 数(语料)| 468,260,367 | 同 |
|
||||||
|
| **训练消费 token** | ~5.12M(2500 步 × 2048)| **~36.9M**(4500 步 × 8192)|
|
||||||
|
| tokenizer | gpt2 BPE(vocab 50257)| 同 |
|
||||||
|
| 缓存 | `data/tinystories-train.txt.u16.bin`(u16,936MB)| **直接复用**(不重 tokenize)|
|
||||||
|
| held-out val | 全量末尾 1,000,000 token | **同一 1M token**(与 v1 完全相同的保留集,便于公平对比)|
|
||||||
|
|
||||||
|
**复用缓存**:`Corpus::load_cached` 读 `<corpus>.u16.bin`(v1 首跑已写盘),v2 启动即时载入 468M token,
|
||||||
|
跳过 2GB 语料的 from-scratch BPE。held-out val 仍是全量末尾 1M token(`split_tail`),与 v1 同一保留集
|
||||||
|
——所以 v1/v2 的 val loss 直接可比。
|
||||||
|
|
||||||
|
## 架构
|
||||||
|
|
||||||
|
v2 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA),
|
||||||
|
forward 图与 v0/v1 完全同构,只是 dims 变大。无结构改动。
|
||||||
|
|
||||||
|
| 维度 | v1 | v2 |
|
||||||
|
|------|----|----|
|
||||||
|
| dim(= heads·head_dim)| 256 | **384** |
|
||||||
|
| n_layers | 8 | **12** |
|
||||||
|
| n_heads | 8 | **12** |
|
||||||
|
| head_dim | 32 | 32 |
|
||||||
|
| ffn_hidden(SwiGLU)| 1024 | **1536** |
|
||||||
|
| vocab | 50257 | 50257 |
|
||||||
|
| **core 参数**(除 embed+lm_head)| 8,393,472(≈8.39M)| **28,322,304(≈28.32M,×3.37)** |
|
||||||
|
| embed + lm_head(2×vocab×dim)| 25,731,584(≈25.7M)| 38,597,376(≈38.6M)|
|
||||||
|
| **总参数** | 34,125,056(≈34.13M)| **66,919,680(≈66.92M)** |
|
||||||
|
|
||||||
|
**core 的量法**:`Config::core_params() = num_params() − 2·vocab·dim`。gpt2 50257 vocab 在 dim384 下让
|
||||||
|
embedding + lm_head 固定占 ~38.6M——这两张表是**词表大小**的函数、不是模型容量,所以阶梯按 **core** 量
|
||||||
|
(v2 core 28.32M 命中 ~27M 目标)。这也是 v2 总参 66.9M「看着大」但有效容量 28.32M core 的原因
|
||||||
|
(gpt2 大词表占比问题见 docs/known-issues.md KI-4)。
|
||||||
|
|
||||||
|
**相比 v1 的架构变化**:纯放大(dim/层/头/ffn),无结构改动。阶梯已参数化,v2 只改
|
||||||
|
`--dim/--heads/--layers/--ffn/--steps` flag,不动模型代码。
|
||||||
|
|
||||||
|
## DDP 训练器对齐(本版工程改动,commit 7090b47)
|
||||||
|
|
||||||
|
v1 的单卡 `bin/train` 已有:参数化 arch、token 缓存、held-out val 评估、warmup→cosine、grad-clip、
|
||||||
|
best-val checkpoint。T8 的 `train_ddp` 当时只是吞吐/正确性 driver(硬编码 tiny config、`Corpus::load`
|
||||||
|
无缓存、无 val/checkpoint)。v2 把它**接到与单卡同一水平**:
|
||||||
|
|
||||||
|
- **复用而非重写**:`eval_loss` / `checkpoint::save` 都在 `xtrain-train`,DDP 直接调用——单卡与 DDP
|
||||||
|
共用一套 eval/checkpoint 路径(不复制逻辑)。
|
||||||
|
- `DdpConfig` 增加 `eval_every / eval_batches / ckpt_path`;`train_rank` 接收 `valid: Option<&Corpus>`、
|
||||||
|
返回 `DdpResult { losses, evals, best_val }`。
|
||||||
|
- **val/checkpoint 只在 rank 0**:DDP 后每 rank 参数 bit-identical(T8 已验证),rank 0 持 val 语料、
|
||||||
|
跑无梯度 eval、写 best-val checkpoint,其余 rank 此处无事可做。
|
||||||
|
- `launch` 把 val 语料只递给 rank 0;`bin/train_ddp` 改成与 `bin/train` 同款 CLI(positional
|
||||||
|
tokenizer/corpus + 全部 arch/优化/val/ckpt flag),复用 u16 缓存。
|
||||||
|
- **T8 语义不变**:all-reduce device 梯度 → /world → 各 rank 本地 GpuAdamW;跨 rank 参数一致性检查仍过
|
||||||
|
(见「验证」)。
|
||||||
|
|
||||||
|
## 超参
|
||||||
|
|
||||||
|
| 项 | 值 | 备注 |
|
||||||
|
|----|----|----|
|
||||||
|
| optimizer | 手写 AdamW(GPU 端 step)| wd=0.1,β/eps 用 xtrain-optim 默认 |
|
||||||
|
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**(同 v1)|
|
||||||
|
| warmup | steps/20 = 225 步 | |
|
||||||
|
| grad clip | global-norm 1.0 | |
|
||||||
|
| steps | **4500** | bounded(≈2.8 小时 @ 4 卡)|
|
||||||
|
| batch | **32**(global)| DDP 分到 4 rank 各 8;单序列模型靠多次 forward 让 tape SUM,clip 时 ×1/b_local |
|
||||||
|
| seq_len | **256** | v1 是 128(更长上下文 + 更省单序列启动开销)|
|
||||||
|
| tokens/step | 32×256 = 8192 | 总训练 token ≈ **36.9M** |
|
||||||
|
| world size | **4**(RTX 5090 ×4,sm_120)| GPU 0-3 |
|
||||||
|
| 精度 | f32(训练)| 导出 xserv 时转 BF16(见 T9)|
|
||||||
|
|
||||||
|
**算力 / DDP scaling**:dash5 4× RTX 5090,全局吞吐 ≈ **3604 tok/s @ 4 卡**,wall-clock ≈ **2.8 小时**。
|
||||||
|
|
||||||
|
⚠️ **DDP 弱扩展(KI-1)**:4 卡 3604 tok/s 仅 ≈ **1.08×** v1 单卡(~3310 tok/s),远未近线性。根因是
|
||||||
|
本版 `global_batch=32`(每卡仅 8)太小:每 step 对**全部参数梯度**做一次 NCCL all-reduce 是固定开销,
|
||||||
|
每卡 compute 太少 → 通信/同步占比过高,吃掉扩展性。对比 T8 在 tiny 规模 micro-benchmark 的近线性
|
||||||
|
(1.87×@2 / 3.01×@4,见 docs/07-distributed.md),差异正是 batch 规模。**v3 先用「显著加大 global
|
||||||
|
batch」缓解**(摊薄 all-reduce、喂饱 GPU),后续再做分桶 / overlapped all-reduce。详见
|
||||||
|
[docs/known-issues.md](../known-issues.md) KI-1。
|
||||||
|
|
||||||
|
## 结果
|
||||||
|
|
||||||
|
- **train loss**:start **10.8867** → end **1.7171**
|
||||||
|
- **best val loss(held-out 1M token)**:**1.7055**(step 4499)
|
||||||
|
- val loss 曲线(每 500 步,单调下降、未见过拟合):
|
||||||
|
|
||||||
|
| step | 499 | 999 | 1499 | 1999 | 2499 | 2999 | 3499 | 3999 | 4499 |
|
||||||
|
|------|----|----|------|------|------|------|------|------|------|
|
||||||
|
| val | 2.7340 | 2.3206 | 2.1007 | 1.9800 | 1.8920 | 1.8110 | 1.7622 | 1.7245 | **1.7055** |
|
||||||
|
|
||||||
|
val 一路降到末步、无回升 = 仍**欠拟合**,更多步数/数据还能继续降(v3 杠杆)。
|
||||||
|
|
||||||
|
### 采样(greedy,xtrain 直采,同 prompt)
|
||||||
|
|
||||||
|
```
|
||||||
|
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
|
||||||
|
outside in the sunshine. One day, she saw a big, red apple on the ground.
|
||||||
|
She picked it up and took a big
|
||||||
|
[The little] → The little girl was so happy and she thanked the man for his help. She said
|
||||||
|
goodbye and went home with a smile on her face. <|endoftext|>
|
||||||
|
[One day] → One day, the little girl was walking in the park when she saw a big, scary
|
||||||
|
dog. The dog was barking and running around. The little girl was scared and
|
||||||
|
started to cry. The dog said
|
||||||
|
```
|
||||||
|
|
||||||
|
温度 0.8 采样同样连贯(多角色、完整情节),见 `RUN.md`。
|
||||||
|
|
||||||
|
## 相比 v1 的提升
|
||||||
|
|
||||||
|
**best val loss(各自训练 run 报告的 held-out 1M token 最优值)**:
|
||||||
|
|
||||||
|
| 模型 | core 参数 | 训练 token | **best val loss** | 说明 |
|
||||||
|
|------|-----------|-----------|-------------------|------|
|
||||||
|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
|
||||||
|
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
|
||||||
|
| v2 | 28.32M(**×3.37**)| ~36.9M(**×7.2**)| **1.7055** | dim384/12L + DDP,val 比 v1 低 **0.88** |
|
||||||
|
|
||||||
|
> v1 训练用 seq128、v2 用 seq256,两次 best-val 是各自训练 run 直接报告的。为做完全 apples-to-apples,
|
||||||
|
> 又在**同一保留集 + 同一 eval 设置(seq256 / 64 batch)**下重评了两个 checkpoint:**v1 2.6756 →
|
||||||
|
> v2 2.0418**(低 **0.634**)。两种量法都给出同向、可观的提升。
|
||||||
|
|
||||||
|
### 并排采样(greedy 40 tok,xserv 服务,同 prompt)
|
||||||
|
|
||||||
|
| prompt | v1 | v2 |
|
||||||
|
|--------|----|----|
|
||||||
|
| `Once upon a time` | …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, **scary dog. The dog was scared and didn't know what to do** | …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, **red apple on the ground. She picked it up and took a big** |
|
||||||
|
| `One day` | One day, she saw a big, shiny ball in the park. She wanted to play with it, but she was too scared to go. | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. **The little girl was scared and ran away. The dog chased her** |
|
||||||
|
| `The little` | The little girl was so happy that she had been able to help. | The little girl was so happy and she thanked the man for his help. **She said goodbye and went home with a smile on her face.** |
|
||||||
|
|
||||||
|
**结论**:v1(8.39M core / 5.1M token)已能写连贯小故事,但句子偏短、常一两句就收尾。v2(28.32M core /
|
||||||
|
36.9M token)在**相同开头**下展开更长、更具体的情节链(捡苹果→咬一口;遇狗→狗追→逃跑),句法更丰富、
|
||||||
|
跨句指代一致,故事密度明显更高。**best val 2.58→1.71(低 0.88)+ 采样从"短句收束"到"多步情节"**,
|
||||||
|
v2 是相对 v1 的清晰、可量化提升。
|
||||||
|
|
||||||
|
## xserv 验证
|
||||||
|
|
||||||
|
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 `docs/08`,
|
||||||
|
**135 tensors** = 12 层 × 11 + embed + norm + lm_head),存入 registry 后用 `xserv-cli` 加载并贪心生成:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ xserv-cli ~/projects/tiny-models/v2-tinystories-dim384 --max-tokens 40
|
||||||
|
Model: qwen3, layers=12, hidden=384, heads=12/12 kv, vocab=50257
|
||||||
|
Loaded 135 tensors
|
||||||
|
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
|
||||||
|
sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big
|
||||||
|
xserv> One day, the little girl was walking in the park when she saw a big, scary dog. The dog
|
||||||
|
was barking and running around. The little girl was scared and ran away. The dog chased her
|
||||||
|
xserv> The little girl was so happy and she thanked the man for his help. She said goodbye and
|
||||||
|
went home with a smile on her face. <|endoftext|>
|
||||||
|
```
|
||||||
|
|
||||||
|
**token-match**:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 中 **2 个逐 token 完全一致**
|
||||||
|
("Once upon a time"、"The little");第 3 个("One day")在后段 "scared and ___" 处因 BF16 漂移分叉
|
||||||
|
(xtrain 选 "started to cry"、xserv 选 "ran away")——单个 logit 微差翻转贪心取值后序列发散,与 v1
|
||||||
|
观察到的 ~0.5% BF16 漂移同源。闭环在 v2 规模仍基本成立(多数 prompt 逐 token 一致,少数因 BF16 末端分叉)。
|
||||||
|
|
||||||
|
## v3 提案
|
||||||
|
|
||||||
|
v2 的 val 曲线一路单调下到末步(无过拟合)= 仍**欠拟合**,同规模再多喂步数/数据还能降。建议 v3:
|
||||||
|
|
||||||
|
- **先修 KI-1(DDP 弱扩展)**:把 `global_batch` 从 32 显著加大(如 128–256,每卡 32–64),摊薄
|
||||||
|
per-step all-reduce、喂饱 GPU,把 4 卡吞吐拉回接近线性——这是 v3 提速的第一杠杆。
|
||||||
|
- **数据/步数**:在更高吞吐下把训练 token 从 ~37M 拉到 ~150–300M(仍在 TinyStories 全量内、不重复
|
||||||
|
一个 epoch),目标 val 进一步降到 ~1.4–1.5。
|
||||||
|
- **模型**:dim 512 / 16 heads·32 / 16 layers / ffn 2048 → core ≈ **75M**(容量 ×2.6)。词表不变 →
|
||||||
|
embed+lm_head ~51.5M,总 ~126M。
|
||||||
|
- **数据阶梯**:v3 仍喂 TinyStories(37M→更多步未榨满,模型也还在 tiny-LM 范围);待 core 进一步放大
|
||||||
|
到 ~100M+、TinyStories 明显成为容量上限后,再按数据阶梯**上更广高质语料**(如 TinyStories + 部分
|
||||||
|
通用语料混合),同步评估是否换更贴合的 tokenizer(缓解 KI-4 大词表占比)。
|
||||||
|
|
||||||
|
阶梯已参数化,v3 改 `--dim/--heads/--layers/--ffn/--steps/--batch` flag + 调 DDP world 即可,不动模型代码。
|
||||||
@@ -11,15 +11,18 @@ xserv 格式验证可服务。
|
|||||||
|
|
||||||
## 对比表
|
## 对比表
|
||||||
|
|
||||||
val loss 一栏给的是**同一 held-out 1M token**(v1 train 末尾切片)上、用 `bin/train --eval-ckpt`
|
val loss 一栏给的是各版**各自训练 run 报告的 best val**(held-out 1M token,全量 train 末尾切片)。
|
||||||
对两个 checkpoint 各自评出来的——同一指标、公平对比。
|
注:v0/v1 训练用 seq128、v2 用 seq256,eval 窗口不同 → 同一保留集 + 同一 eval 设置(seq256/64batch)
|
||||||
|
重评 v1=2.6756→v2=2.0418(低 0.634,apples-to-apples);下表 best-val 同向。
|
||||||
|
|
||||||
| 版本 | 数据 | 架构 (dim/L/heads·hd/ffn) | core 参数 | 总参数 | val loss | 备注 |
|
| 版本 | 数据 | 架构 (dim/L/heads·hd/ffn) | core 参数 | 总参数 | val loss | 备注 |
|
||||||
|---|---|---|---|---|---|---|
|
|---|---|---|---|---|---|---|
|
||||||
| [v0-baseline](../../docs/05-training-loop.md) | TinyStories valid 3MB 切片 (~72 万 tok) | 32 / 4 / 2·16 / 64 | ~41K | 3.26M | **3.8050** | 太小不可用;采样陷入 "mommy's mommy's mommy" 循环 |
|
| [v0-baseline](../../docs/05-training-loop.md) | TinyStories valid 3MB 切片 (~72 万 tok) | 32 / 4 / 2·16 / 64 | ~41K | 3.26M | **3.8050** | 太小不可用;采样陷入 "mommy's mommy's mommy" 循环 |
|
||||||
| [v1-tinystories-dim256](01-v1-tinystories-dim256.md) | TinyStories **全量 train** (468.3M tok, u16 缓存) | 256 / 8 / 8·32 / 1024 | 8.39M | 34.13M | **2.5847** | 全量数据 + dim256/8L;val 低 1.22,采样连贯成篇;~25.9min/单卡 |
|
| [v1-tinystories-dim256](01-v1-tinystories-dim256.md) | TinyStories **全量 train** (468.3M tok, u16 缓存) | 256 / 8 / 8·32 / 1024 | 8.39M | 34.13M | **2.5847** | 全量数据 + dim256/8L;val 低 1.22,采样连贯成篇;~25.9min/单卡 |
|
||||||
|
| [v2-tinystories-dim384](02-v2-tinystories-dim384.md) | TinyStories 全量 (复用 v1 缓存, 训 ~36.9M tok) | 384 / 12 / 12·32 / 1536 | 28.32M | 66.92M | **1.7055** | dim384/12L + **DDP 4 卡**;val 比 v1 低 0.88,情节更长;~2.8h/4 卡。⚠️ DDP 弱扩展见 [KI-1](../known-issues.md) |
|
||||||
|
|
||||||
## 下一档(提案)
|
## 下一档(提案)
|
||||||
|
|
||||||
- **v2**(待派发):见 `01-v1-*.md` 末尾 "v2 提案"。
|
- **v3**(待派发):见 `02-v2-*.md` 末尾 "v3 提案"——先修 KI-1(加大 global batch 恢复 DDP 扩展),
|
||||||
|
再放大 dim512/16L (~75M core) + 更多步数,TinyStories 接近上限后上更广语料。
|
||||||
</content>
|
</content>
|
||||||
|
|||||||
Reference in New Issue
Block a user