Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 KiB
Scaling Run v2: TinyStories + dim384/12L + DDP 多卡 — Design Document
Goal
在 v1(dim256/8L、core 8.39M、全量 TinyStories 但只训了 ~5.1M token、单卡)之上,沿模型 + 数据 + 并行三个轴同时放大,做第一次多卡 DDP 训练:
- 模型放大:dim 256→384、层 8→12、头 8→12,把 transformer core 做到 ~28M 参(容量 ×3.4), 词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim384 下固定加 ~38.6M,单列出来。
- 数据放大:v1 只消费了 ~5.1M token(欠拟合,val 一路降到末步);v2 训 ~37M token(×7.2), 复用 v1 已缓存的全量 TinyStories token-id 流(不重新 tokenize 2GB 语料)。
- 多卡 DDP:用 T8 的
xtrain-distributed(NCCL 数据并行)在 4 张 RTX 5090 上训练,把多卡 wall-clock 压回 bounded 区间。 - 训练器对齐:把 T8 的
train_ddp接上bin/train已有的——参数化 arch / token 缓存 / held-out val 评估 / warmup→cosine / grad-clip / best-val checkpoint——单卡与 DDP 共用一套 eval/checkpoint 逻辑。 - 训完存 registry(
~/projects/tiny-models/v2-tinystories-dim384/)+ 导出 xserv 格式验证可服务,给出 相比 v1 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。
范围(escape hatch 已评估):单序列模型设计(每个 sequence 一次独立 forward、逐 op 启动开销)使 dim384/seq256 下 DDP 全局吞吐 ≈ 3.6K tok/s @ 4 卡(GPU 利用率偏低,已知瓶颈见 docs/06,且本版 的小 global batch 又放大了 all-reduce 占比,见 KI-1)。为在共享机上 bounded(~2.8 小时)内拿到 「清晰、可量化超过 v1」的结果,v2 训 4500 步 ≈ 37M token,不追求把 37M 之外榨满——v2 的目的是 验证 DDP 训练器对齐 + 相对 v1 的明确提升(val<2.2),不是榨满模型。
数据
| 项 | v1 | v2 |
|---|---|---|
| 来源 | TinyStories 全量 train | 同(复用 v1 缓存) |
| token 数(语料) | 468,260,367 | 同 |
| 训练消费 token | ~5.12M(2500 步 × 2048) | ~36.9M(4500 步 × 8192) |
| tokenizer | gpt2 BPE(vocab 50257) | 同 |
| 缓存 | data/tinystories-train.txt.u16.bin(u16,936MB) |
直接复用(不重 tokenize) |
| held-out val | 全量末尾 1,000,000 token | 同一 1M token(与 v1 完全相同的保留集,便于公平对比) |
复用缓存:Corpus::load_cached 读 <corpus>.u16.bin(v1 首跑已写盘),v2 启动即时载入 468M token,
跳过 2GB 语料的 from-scratch BPE。held-out val 仍是全量末尾 1M token(split_tail),与 v1 同一保留集
——所以 v1/v2 的 val loss 直接可比。
架构
v2 = 更大、同构的 tiny Qwen3(RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head,MHA), forward 图与 v0/v1 完全同构,只是 dims 变大。无结构改动。
| 维度 | v1 | v2 |
|---|---|---|
| dim(= heads·head_dim) | 256 | 384 |
| n_layers | 8 | 12 |
| n_heads | 8 | 12 |
| head_dim | 32 | 32 |
| ffn_hidden(SwiGLU) | 1024 | 1536 |
| vocab | 50257 | 50257 |
| core 参数(除 embed+lm_head) | 8,393,472(≈8.39M) | 28,322,304(≈28.32M,×3.37) |
| embed + lm_head(2×vocab×dim) | 25,731,584(≈25.7M) | 38,597,376(≈38.6M) |
| 总参数 | 34,125,056(≈34.13M) | 66,919,680(≈66.92M) |
core 的量法:Config::core_params() = num_params() − 2·vocab·dim。gpt2 50257 vocab 在 dim384 下让
embedding + lm_head 固定占 ~38.6M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 core 量
(v2 core 28.32M 命中 ~27M 目标)。这也是 v2 总参 66.9M「看着大」但有效容量 28.32M core 的原因
(gpt2 大词表占比问题见 docs/known-issues.md KI-4)。
相比 v1 的架构变化:纯放大(dim/层/头/ffn),无结构改动。阶梯已参数化,v2 只改
--dim/--heads/--layers/--ffn/--steps flag,不动模型代码。
DDP 训练器对齐(本版工程改动,commit 7090b47)
v1 的单卡 bin/train 已有:参数化 arch、token 缓存、held-out val 评估、warmup→cosine、grad-clip、
best-val checkpoint。T8 的 train_ddp 当时只是吞吐/正确性 driver(硬编码 tiny config、Corpus::load
无缓存、无 val/checkpoint)。v2 把它接到与单卡同一水平:
- 复用而非重写:
eval_loss/checkpoint::save都在xtrain-train,DDP 直接调用——单卡与 DDP 共用一套 eval/checkpoint 路径(不复制逻辑)。 DdpConfig增加eval_every / eval_batches / ckpt_path;train_rank接收valid: Option<&Corpus>、 返回DdpResult { losses, evals, best_val }。- val/checkpoint 只在 rank 0:DDP 后每 rank 参数 bit-identical(T8 已验证),rank 0 持 val 语料、 跑无梯度 eval、写 best-val checkpoint,其余 rank 此处无事可做。
launch把 val 语料只递给 rank 0;bin/train_ddp改成与bin/train同款 CLI(positional tokenizer/corpus + 全部 arch/优化/val/ckpt flag),复用 u16 缓存。- T8 语义不变:all-reduce device 梯度 → /world → 各 rank 本地 GpuAdamW;跨 rank 参数一致性检查仍过 (见「验证」)。
超参
| 项 | 值 | 备注 |
|---|---|---|
| optimizer | 手写 AdamW(GPU 端 step) | wd=0.1,β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr 6e-4 → min_lr 6e-5(同 v1) |
| warmup | steps/20 = 225 步 | |
| grad clip | global-norm 1.0 | |
| steps | 4500 | bounded(≈2.8 小时 @ 4 卡) |
| batch | 32(global) | DDP 分到 4 rank 各 8;单序列模型靠多次 forward 让 tape SUM,clip 时 ×1/b_local |
| seq_len | 256 | v1 是 128(更长上下文 + 更省单序列启动开销) |
| tokens/step | 32×256 = 8192 | 总训练 token ≈ 36.9M |
| world size | 4(RTX 5090 ×4,sm_120) | GPU 0-3 |
| 精度 | f32(训练) | 导出 xserv 时转 BF16(见 T9) |
算力 / DDP scaling:dash5 4× RTX 5090,全局吞吐 ≈ 3604 tok/s @ 4 卡,wall-clock ≈ 2.8 小时。
⚠️ DDP 弱扩展(KI-1):4 卡 3604 tok/s 仅 ≈ 1.08× v1 单卡(~3310 tok/s),远未近线性。根因是
本版 global_batch=32(每卡仅 8)太小:每 step 对全部参数梯度做一次 NCCL all-reduce 是固定开销,
每卡 compute 太少 → 通信/同步占比过高,吃掉扩展性。对比 T8 在 tiny 规模 micro-benchmark 的近线性
(1.87×@2 / 3.01×@4,见 docs/07-distributed.md),差异正是 batch 规模。v3 先用「显著加大 global
batch」缓解(摊薄 all-reduce、喂饱 GPU),后续再做分桶 / overlapped all-reduce。详见
docs/known-issues.md KI-1。
结果
- train loss:start 10.8867 → end 1.7171
- best val loss(held-out 1M token):1.7055(step 4499)
- val loss 曲线(每 500 步,单调下降、未见过拟合):
| step | 499 | 999 | 1499 | 1999 | 2499 | 2999 | 3499 | 3999 | 4499 |
|---|---|---|---|---|---|---|---|---|---|
| val | 2.7340 | 2.3206 | 2.1007 | 1.9800 | 1.8920 | 1.8110 | 1.7622 | 1.7245 | 1.7055 |
val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据还能继续降(v3 杠杆)。
采样(greedy,xtrain 直采,同 prompt)
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the sunshine. One day, she saw a big, red apple on the ground.
She picked it up and took a big
[The little] → The little girl was so happy and she thanked the man for his help. She said
goodbye and went home with a smile on her face. <|endoftext|>
[One day] → One day, the little girl was walking in the park when she saw a big, scary
dog. The dog was barking and running around. The little girl was scared and
started to cry. The dog said
温度 0.8 采样同样连贯(多角色、完整情节),见 RUN.md。
相比 v1 的提升
best val loss(各自训练 run 报告的 held-out 1M token 最优值):
| 模型 | core 参数 | 训练 token | best val loss | 说明 |
|---|---|---|---|---|
| v0-baseline | 41K | ~0.72M | 3.8050 | 3MB 切片,采样退化循环 |
| v1 | 8.39M | ~5.1M | 2.5847 | 全量数据 + dim256/8L,单卡 |
| v2 | 28.32M(×3.37) | ~36.9M(×7.2) | 1.7055 | dim384/12L + DDP,val 比 v1 低 0.88 |
v1 训练用 seq128、v2 用 seq256,两次 best-val 是各自训练 run 直接报告的。为做完全 apples-to-apples, 又在**同一保留集 + 同一 eval 设置(seq256 / 64 batch)**下重评了两个 checkpoint:v1 2.6756 → v2 2.0418(低 0.634)。两种量法都给出同向、可观的提升。
并排采样(greedy 40 tok,xserv 服务,同 prompt)
| prompt | v1 | v2 |
|---|---|---|
Once upon a time |
…a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog was scared and didn't know what to do | …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big |
One day |
One day, she saw a big, shiny ball in the park. She wanted to play with it, but she was too scared to go. | One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. The little girl was scared and ran away. The dog chased her |
The little |
The little girl was so happy that she had been able to help. | The little girl was so happy and she thanked the man for his help. She said goodbye and went home with a smile on her face. |
结论:v1(8.39M core / 5.1M token)已能写连贯小故事,但句子偏短、常一两句就收尾。v2(28.32M core / 36.9M token)在相同开头下展开更长、更具体的情节链(捡苹果→咬一口;遇狗→狗追→逃跑),句法更丰富、 跨句指代一致,故事密度明显更高。best val 2.58→1.71(低 0.88)+ 采样从"短句收束"到"多步情节", v2 是相对 v1 的清晰、可量化提升。
xserv 验证
导出 HF Qwen3 safetensors(命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16,见 T9 docs/08,
135 tensors = 12 层 × 11 + embed + norm + lm_head),存入 registry 后用 xserv-cli 加载并贪心生成:
$ xserv-cli ~/projects/tiny-models/v2-tinystories-dim384 --max-tokens 40
Model: qwen3, layers=12, hidden=384, heads=12/12 kv, vocab=50257
Loaded 135 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big
xserv> One day, the little girl was walking in the park when she saw a big, scary dog. The dog
was barking and running around. The little girl was scared and ran away. The dog chased her
xserv> The little girl was so happy and she thanked the man for his help. She said goodbye and
went home with a smile on her face. <|endoftext|>
token-match:xserv(BF16)对 xtrain 自身贪心(F32),3 个 prompt 中 2 个逐 token 完全一致 ("Once upon a time"、"The little");第 3 个("One day")在后段 "scared and ___" 处因 BF16 漂移分叉 (xtrain 选 "started to cry"、xserv 选 "ran away")——单个 logit 微差翻转贪心取值后序列发散,与 v1 观察到的 ~0.5% BF16 漂移同源。闭环在 v2 规模仍基本成立(多数 prompt 逐 token 一致,少数因 BF16 末端分叉)。
v3 提案
v2 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,同规模再多喂步数/数据还能降。建议 v3:
- 先修 KI-1(DDP 弱扩展):把
global_batch从 32 显著加大(如 128–256,每卡 32–64),摊薄 per-step all-reduce、喂饱 GPU,把 4 卡吞吐拉回接近线性——这是 v3 提速的第一杠杆。 - 数据/步数:在更高吞吐下把训练 token 从 ~37M 拉到 ~150–300M(仍在 TinyStories 全量内、不重复 一个 epoch),目标 val 进一步降到 ~1.4–1.5。
- 模型:dim 512 / 16 heads·32 / 16 layers / ffn 2048 → core ≈ 75M(容量 ×2.6)。词表不变 → embed+lm_head ~51.5M,总 ~126M。
- 数据阶梯:v3 仍喂 TinyStories(37M→更多步未榨满,模型也还在 tiny-LM 范围);待 core 进一步放大 到 ~100M+、TinyStories 明显成为容量上限后,再按数据阶梯上更广高质语料(如 TinyStories + 部分 通用语料混合),同步评估是否换更贴合的 tokenizer(缓解 KI-4 大词表占比)。
阶梯已参数化,v3 改 --dim/--heads/--layers/--ffn/--steps/--batch flag + 调 DDP world 即可,不动模型代码。