Files
xtrain/docs/runs/02-v2-tinystories-dim384.md
Gahow Wang bf679f6f1f docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h
SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M
tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4
RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58
and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the
dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts
(3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat
(global batch too small → all-reduce dominates) → links docs/known-issues
KI-1; v3 proposal applies KI-1's fix (much larger global batch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:38:31 +08:00

13 KiB
Raw Blame History

Scaling Run v2: TinyStories + dim384/12L + DDP 多卡 — Design Document

Goal

在 v1dim256/8L、core 8.39M、全量 TinyStories 但只训了 ~5.1M token、单卡之上沿模型 + 数据 + 并行三个轴同时放大,做第一次多卡 DDP 训练

  1. 模型放大dim 256→384、层 8→12、头 8→12transformer core 做到 ~28M 参(容量 ×3.4 词表不变 → embed+lm_head 因 gpt2 50257 vocab 在 dim384 下固定加 ~38.6M,单列出来。
  2. 数据放大v1 只消费了 ~5.1M token欠拟合val 一路降到末步v2 训 ~37M token×7.2 复用 v1 已缓存的全量 TinyStories token-id 流(不重新 tokenize 2GB 语料)。
  3. 多卡 DDP:用 T8 的 xtrain-distributedNCCL 数据并行)在 4 张 RTX 5090 上训练,把多卡 wall-clock 压回 bounded 区间。
  4. 训练器对齐:把 T8 的 train_ddp 接上 bin/train 已有的——参数化 arch / token 缓存 / held-out val 评估 / warmup→cosine / grad-clip / best-val checkpoint——单卡与 DDP 共用一套 eval/checkpoint 逻辑。
  5. 训完存 registry~/projects/tiny-models/v2-tinystories-dim384/+ 导出 xserv 格式验证可服务,给出 相比 v1 的具体提升(同一保留集 val loss + 同 prompt 并排采样)。

范围escape hatch 已评估):单序列模型设计(每个 sequence 一次独立 forward、逐 op 启动开销)使 dim384/seq256 下 DDP 全局吞吐 ≈ 3.6K tok/s @ 4 卡GPU 利用率偏低,已知瓶颈见 docs/06且本版 的小 global batch 又放大了 all-reduce 占比,见 KI-1。为在共享机上 bounded~2.8 小时)内拿到 「清晰、可量化超过 v1」的结果v2 训 4500 步 ≈ 37M token,不追求把 37M 之外榨满——v2 的目的是 验证 DDP 训练器对齐 + 相对 v1 的明确提升val<2.2),不是榨满模型。

数据

v1 v2
来源 TinyStories 全量 train 同(复用 v1 缓存)
token 数(语料) 468,260,367
训练消费 token ~5.12M2500 步 × 2048 ~36.9M4500 步 × 8192
tokenizer gpt2 BPEvocab 50257
缓存 data/tinystories-train.txt.u16.binu16936MB 直接复用(不重 tokenize
held-out val 全量末尾 1,000,000 token 同一 1M token(与 v1 完全相同的保留集,便于公平对比)

复用缓存Corpus::load_cached<corpus>.u16.binv1 首跑已写盘v2 启动即时载入 468M token 跳过 2GB 语料的 from-scratch BPE。held-out val 仍是全量末尾 1M tokensplit_tail),与 v1 同一保留集 ——所以 v1/v2 的 val loss 直接可比。

架构

v2 = 更大、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA forward 图与 v0/v1 完全同构,只是 dims 变大。无结构改动。

维度 v1 v2
dim= heads·head_dim 256 384
n_layers 8 12
n_heads 8 12
head_dim 32 32
ffn_hiddenSwiGLU 1024 1536
vocab 50257 50257
core 参数(除 embed+lm_head 8,393,472≈8.39M 28,322,304≈28.32M×3.37
embed + lm_head2×vocab×dim 25,731,584≈25.7M 38,597,376≈38.6M
总参数 34,125,056≈34.13M 66,919,680≈66.92M

core 的量法Config::core_params() = num_params() 2·vocab·dim。gpt2 50257 vocab 在 dim384 下让 embedding + lm_head 固定占 ~38.6M——这两张表是词表大小的函数、不是模型容量,所以阶梯按 corev2 core 28.32M 命中 ~27M 目标)。这也是 v2 总参 66.9M「看着大」但有效容量 28.32M core 的原因 gpt2 大词表占比问题见 docs/known-issues.md KI-4

相比 v1 的架构变化纯放大dim/层/头/ffn无结构改动。阶梯已参数化v2 只改 --dim/--heads/--layers/--ffn/--steps flag不动模型代码。

DDP 训练器对齐本版工程改动commit 7090b47

v1 的单卡 bin/train 已有:参数化 arch、token 缓存、held-out val 评估、warmup→cosine、grad-clip、 best-val checkpoint。T8 的 train_ddp 当时只是吞吐/正确性 driver硬编码 tiny config、Corpus::load 无缓存、无 val/checkpoint。v2 把它接到与单卡同一水平

  • 复用而非重写eval_loss / checkpoint::save 都在 xtrain-trainDDP 直接调用——单卡与 DDP 共用一套 eval/checkpoint 路径(不复制逻辑)。
  • DdpConfig 增加 eval_every / eval_batches / ckpt_pathtrain_rank 接收 valid: Option<&Corpus>、 返回 DdpResult { losses, evals, best_val }
  • val/checkpoint 只在 rank 0DDP 后每 rank 参数 bit-identicalT8 已验证rank 0 持 val 语料、 跑无梯度 eval、写 best-val checkpoint其余 rank 此处无事可做。
  • launch 把 val 语料只递给 rank 0bin/train_ddp 改成与 bin/train 同款 CLIpositional tokenizer/corpus + 全部 arch/优化/val/ckpt flag复用 u16 缓存。
  • T8 语义不变all-reduce device 梯度 → /world → 各 rank 本地 GpuAdamW跨 rank 参数一致性检查仍过 (见「验证」)。

超参

备注
optimizer 手写 AdamWGPU 端 step wd=0.1,β/eps 用 xtrain-optim 默认
LR schedule 线性 warmup → cosine decay max_lr 6e-4 → min_lr 6e-5(同 v1
warmup steps/20 = 225 步
grad clip global-norm 1.0
steps 4500 bounded≈2.8 小时 @ 4 卡)
batch 32global DDP 分到 4 rank 各 8单序列模型靠多次 forward 让 tape SUMclip 时 ×1/b_local
seq_len 256 v1 是 128更长上下文 + 更省单序列启动开销)
tokens/step 32×256 = 8192 总训练 token ≈ 36.9M
world size 4RTX 5090 ×4sm_120 GPU 0-3
精度 f32训练 导出 xserv 时转 BF16见 T9

算力 / DDP scalingdash5 4× RTX 5090全局吞吐 ≈ 3604 tok/s @ 4 卡wall-clock ≈ 2.8 小时

⚠️ DDP 弱扩展KI-14 卡 3604 tok/s 仅 ≈ 1.08× v1 单卡(~3310 tok/s远未近线性。根因是 本版 global_batch=32(每卡仅 8太小每 step 对全部参数梯度做一次 NCCL all-reduce 是固定开销, 每卡 compute 太少 → 通信/同步占比过高,吃掉扩展性。对比 T8 在 tiny 规模 micro-benchmark 的近线性 1.87×@2 / 3.01×@4见 docs/07-distributed.md差异正是 batch 规模。v3 先用「显著加大 global batch」缓解(摊薄 all-reduce、喂饱 GPU后续再做分桶 / overlapped all-reduce。详见 docs/known-issues.md KI-1。

结果

  • train lossstart 10.8867 → end 1.7171
  • best val lossheld-out 1M token1.7055step 4499
  • val loss 曲线(每 500 步,单调下降、未见过拟合):
step 499 999 1499 1999 2499 2999 3499 3999 4499
val 2.7340 2.3206 2.1007 1.9800 1.8920 1.8110 1.7622 1.7245 1.7055

val 一路降到末步、无回升 = 仍欠拟合,更多步数/数据还能继续降v3 杠杆)。

采样greedyxtrain 直采,同 prompt

[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
                     outside in the sunshine. One day, she saw a big, red apple on the ground.
                     She picked it up and took a big
[The little]       → The little girl was so happy and she thanked the man for his help. She said
                     goodbye and went home with a smile on her face. <|endoftext|>
[One day]          → One day, the little girl was walking in the park when she saw a big, scary
                     dog. The dog was barking and running around. The little girl was scared and
                     started to cry. The dog said

温度 0.8 采样同样连贯(多角色、完整情节),见 RUN.md

相比 v1 的提升

best val loss各自训练 run 报告的 held-out 1M token 最优值)

模型 core 参数 训练 token best val loss 说明
v0-baseline 41K ~0.72M 3.8050 3MB 切片,采样退化循环
v1 8.39M ~5.1M 2.5847 全量数据 + dim256/8L单卡
v2 28.32M×3.37 ~36.9M×7.2 1.7055 dim384/12L + DDPval 比 v1 低 0.88

v1 训练用 seq128、v2 用 seq256两次 best-val 是各自训练 run 直接报告的。为做完全 apples-to-apples 又在**同一保留集 + 同一 eval 设置seq256 / 64 batch**下重评了两个 checkpointv1 2.6756 → v2 2.0418(低 0.634)。两种量法都给出同向、可观的提升。

并排采样greedy 40 tokxserv 服务,同 prompt

prompt v1 v2
Once upon a time …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog was scared and didn't know what to do …a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big
One day One day, she saw a big, shiny ball in the park. She wanted to play with it, but she was too scared to go. One day, the little girl was walking in the park when she saw a big, scary dog. The dog was barking and running around. The little girl was scared and ran away. The dog chased her
The little The little girl was so happy that she had been able to help. The little girl was so happy and she thanked the man for his help. She said goodbye and went home with a smile on her face.

结论v18.39M core / 5.1M token已能写连贯小故事但句子偏短、常一两句就收尾。v228.32M core / 36.9M token相同开头下展开更长、更具体的情节链(捡苹果→咬一口;遇狗→狗追→逃跑),句法更丰富、 跨句指代一致,故事密度明显更高。best val 2.58→1.71(低 0.88+ 采样从"短句收束"到"多步情节" v2 是相对 v1 的清晰、可量化提升。

xserv 验证

导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16见 T9 docs/08 135 tensors = 12 层 × 11 + embed + norm + lm_head存入 registry 后用 xserv-cli 加载并贪心生成:

$ xserv-cli ~/projects/tiny-models/v2-tinystories-dim384 --max-tokens 40
Model: qwen3, layers=12, hidden=384, heads=12/12 kv, vocab=50257
Loaded 135 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
       sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a big
xserv> One day, the little girl was walking in the park when she saw a big, scary dog. The dog
       was barking and running around. The little girl was scared and ran away. The dog chased her
xserv> The little girl was so happy and she thanked the man for his help. She said goodbye and
       went home with a smile on her face. <|endoftext|>

token-matchxservBF16对 xtrain 自身贪心F323 个 prompt 中 2 个逐 token 完全一致 "Once upon a time"、"The little");第 3 个("One day")在后段 "scared and ___" 处因 BF16 漂移分叉 xtrain 选 "started to cry"、xserv 选 "ran away")——单个 logit 微差翻转贪心取值后序列发散,与 v1 观察到的 ~0.5% BF16 漂移同源。闭环在 v2 规模仍基本成立(多数 prompt 逐 token 一致,少数因 BF16 末端分叉)。

v3 提案

v2 的 val 曲线一路单调下到末步(无过拟合)= 仍欠拟合,同规模再多喂步数/数据还能降。建议 v3

  • 先修 KI-1DDP 弱扩展):把 global_batch 从 32 显著加大(如 128256每卡 3264摊薄 per-step all-reduce、喂饱 GPU把 4 卡吞吐拉回接近线性——这是 v3 提速的第一杠杆。
  • 数据/步数:在更高吞吐下把训练 token 从 ~37M 拉到 ~150300M仍在 TinyStories 全量内、不重复 一个 epoch目标 val 进一步降到 ~1.41.5。
  • 模型dim 512 / 16 heads·32 / 16 layers / ffn 2048 → core ≈ 75M(容量 ×2.6)。词表不变 → embed+lm_head ~51.5M,总 ~126M。
  • 数据阶梯v3 仍喂 TinyStories37M→更多步未榨满模型也还在 tiny-LM 范围);待 core 进一步放大 到 ~100M+、TinyStories 明显成为容量上限后,再按数据阶梯上更广高质语料(如 TinyStories + 部分 通用语料混合),同步评估是否换更贴合的 tokenizer缓解 KI-4 大词表占比)。

阶梯已参数化v3 改 --dim/--heads/--layers/--ffn/--steps/--batch flag + 调 DDP world 即可,不动模型代码。