Files
xtrain/docs/runs/01-v1-tinystories-dim256.md
Gahow Wang 264660527f docs: run v1 — TinyStories full, dim256
docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table.
v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M).
Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent
stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical
in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:09:46 +08:00

180 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Scaling Run v1: TinyStories 全量 + dim256/8L — Design Document
## Goal
在 v0-baselinedim32/4L、core ~41K 参、只喂 TinyStories valid 的 3MB 切片)之上,做第一次**有意义
的放大**
1. **数据放大**:从 3MB 切片 → **TinyStories 全量 train**468.3M tokens并解决「2GB 语料每次重新
跑 from-scratch BPE 太慢」——tokenize **一次**、把 token-id 流缓存到盘,后续直接读缓存。
2. **模型放大**dim 32→256、层 4→8、头 2→8**transformer core 做到 ~8M 参**embedding+lm_head
因 gpt2 50257 vocab 固定再加 ~25.7M,属预期,单列出来)。
3. **参数化阶梯**:把模型尺寸从硬编码改成 CLI 可调(`--heads/--head-dim/--layers/--ffn`),让 v2/v3
只改 flag 即可,不再动代码。
4. 训完存 registry`~/projects/tiny-models/v1-tinystories-dim256/`+ 导出 xserv 格式验证可服务,并给出
**相比 v0 的具体提升**(同一保留集 val loss + 同 prompt 并排采样)。
> 范围escape hatch 已评估):全量 2GB 下载 + 全量 tokenize 实测都很快下载即时、tokenize ~75s 出
> 468M token 缓存),所以 v1 **用了全量语料**。训练在 bounded 预算内做 2500 步≈5.1M token不追求把
> 全量跑满一个 epoch——v1 的目的是「相对 v0 的清晰、可量化提升 + 参数化阶梯 + 设计文档」,不是榨满模型。
## 数据
| 项 | v0-baseline | v1 |
|----|-------------|----|
| 来源 | TinyStories **valid** 的 3MB 字节切片 | TinyStories **全量 train**hf-mirror.com|
| 原始大小 | 3 MB | 1.92 GB`TinyStories-train.txt`Content-Length 1924281556|
| token 数 | ~72 万 | **468,260,367**≈468M|
| tokenizer | 复用 xserv from-scratch GPT-2 BPEvocab 50257| 同 |
| 缓存 | 无(每次重 tokenize| **`<corpus>.u16.bin`**468M token 的 u16 流936MB首跑 tokenize 一次写盘,后续直接读 |
| 验证集 | 无独立 val 切片 | 全量末尾保留 **1,000,000 token** 作 held-out val训练不触及|
**下载**`curl -sL https://hf-mirror.com/datasets/roneneldan/TinyStories/resolve/main/TinyStories-train.txt`
hf-mirror 302 跳到 xethub CDN直连可下HF 直连被墙)。
**缓存设计(`crates/xtrain-train/src/data.rs`**gpt2 vocab=50257 < 65536token id **u16** 无损存储
`Corpus::load_cached` 首跑 tokenize 整个语料并写 `<path>.u16.bin` little-endian `[u16]` header
路径为 key后续 run 直接读缓存跳过 BPE实测全量 1.92GB tokenize 一次 ~75s之后每次 run 读缓存
**即时**这把2GB × 每次 BPE的反复开销摊成一次
**为什么比 v0 更高质/更大**v0 只喂了 valid 集的一个 3MB 字节切片~72 token且是 byte-range 抓的
首尾残story覆盖的故事极少词汇/句式重复度高 模型只能记住极少数模板采样里反复 "mommy's mommy's
mommy")。v1 用全量 train468M token数百万个完整小故事故事/句式/词汇覆盖面大几个数量级随机窗口
采样能见到远更丰富的语言结构
## 架构
v1 = 一个更大的、同构的 tiny Qwen3RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_headMHA
forward 图与 v0 完全同构只是 dims 变大
| 维度 | v0-baseline | v1 |
|------|-------------|----|
| dim= heads·head_dim| 32 | **256** |
| n_layers | 4 | **8** |
| n_heads | 2 | **8** |
| head_dim | 16 | **32** |
| ffn_hiddenSwiGLU| 64 | **1024** |
| vocab | 50257 | 50257 |
| **core 参数** embed+lm_head| **41,376** | **8,393,472**(≈8.39M|
| embed + lm_head2×vocab×dim| 3,216,448 | 25,731,584(≈25.7M|
| **总参数** | 3,257,824 | **34,125,056**(≈34.13M|
**core 的量法**`Config::core_params() = num_params() 2·vocab·dim`gpt2 50257 vocab dim256 下让
embedding + lm_head 固定占 ~25.7M——这两张表是**词表大小**的函数不是模型容量所以阶梯按 **core**
v1 core 8.39M 命中 ~8M 目标)。这也是为什么 v1 总参 34M看着大但有效容量是 8.39M core
**相比 v0 的架构变化**纯放大无结构改动QK-norm/RoPE/SwiGLU/MHA 都在 v0 就有T9 已对齐 xserv)。
唯一工程改动是把尺寸**参数化**见下)。
### 参数化阶梯(实现)
`Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn)` 派生 `dim = n_heads·head_dim`
`bin/train` `bin/export_safetensors` 都从 CLI flag 读架构`--heads/--head-dim/--layers/--ffn`
默认值复刻 v0 tiny configv2/v3 只改 flag
```sh
# v1
./train tokenizer.json data/tinystories-train.txt \
--heads 8 --head-dim 32 --layers 8 --ffn 1024 \
--steps 2500 --batch 16 --seq 128 --max-lr 6e-4 --min-lr 6e-5 \
--val-tokens 1000000 --eval-every 250 --ckpt /tmp/xtrain_v1.ckpt
```
## 超参
| | | 备注 |
|----|----|----|
| optimizer | 手写 AdamWGPU step| wd=0.1,β/eps xtrain-optim 默认 |
| LR schedule | 线性 warmup cosine decay | max_lr **6e-4** min_lr **6e-5** |
| warmup | steps/20 = 125 | |
| grad clip | global-norm 1.0 | |
| steps | **2500** | bounded(≈25 min 单卡|
| batch | **16** | 单序列模型靠多次 forward tape SUM 梯度clip ×1/batch 取均值 |
| seq_len | **128** | v0 64 |
| tokens/step | 16×128 = 2048 | 总训练 token 5.12M |
| 精度 | f32训练| 导出 xserv 时转 BF16 T9|
**算力**dash5 单卡 RTX 5090GPU 1sm_120吞吐 **3.3K tok/s**单序列设计 GPU 利用率 ~25-29%
是已知瓶颈 docs/06wall-clock **25.9 min**1551s, EXIT=0。DDP 多卡路径存在T8~1.87x@2
v1 单卡已足以清晰超过 v0未启用——留作 v2 提速杠杆
## 结果
- **train loss**start 10.8590 end 2.6247
- **best val lossheld-out 1M token****2.5847**step 2499
- val loss 曲线 250
| step | 249 | 499 | 749 | 999 | 1249 | 1499 | 1749 | 1999 | 2249 | 2499 |
|------|----|----|----|----|------|------|------|------|------|------|
| val | 3.8609 | 3.3534 | 3.1114 | 2.9702 | 2.8498 | 2.7643 | 2.7046 | 2.6496 | 2.6124 | **2.5847** |
单调下降未见过拟合val 一路降到末步说明 2500 步仍欠拟合——更多步数/数据还能继续降v2 杠杆)。
### 采样greedyxtrain 直采,同 prompt
```
[Once upon a time] → Once upon a time, there was a little girl named Lily. She loved to play
outside in the sunshine. One day, she saw a big, scary dog. The dog was
scared and didn't know what to do
[The little] → The little girl was so happy that she had been able to help.
<|endoftext|> Once upon a time there was a little girl named Lucy. She was
three years old and loved to explore. One day,
[One day] → One day, she saw a big, shiny ball in the park. She wanted to play with it,
but she was too scared to go. She went to the park and saw a big, scary dog
```
温度 0.8 采样同样连贯多角色完整情节 `RUN.md`
## 相比 v0 的提升
**同一保留集v1 train 末尾 1M token上的 val loss**——`bin/train --eval-ckpt` 加载各自 checkpoint
**同一 held-out 1M token** 上算 cross-entropy把两个模型放到同一指标公平对比
| 模型 | core 参数 | 训练数据 | **val loss同一 1M held-out** |
|------|-----------|----------|------------------------------|
| v0-baseline | 41K | 3MB 切片~72万 tok| **3.8050** |
| v1 | 8.39M**×203**| 全量 468M**×650**| **2.5847**** 1.22**|
### 并排采样greedy 40 tokxserv 服务,同 prompt
| prompt | v0-baseline | v1 |
|--------|-------------|----|
| `Once upon a time` | a little girl named Lily. **Timmy** loved to play with her mommy. One day, **Timmy's mommy's mommy's mommy**. "I'm sorry, I | a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, scary dog. The dog was scared and didn't know what to do |
| `The little` | The little girl named Lily. She loved to play with her mommy. One day, **Timmy's mommy's mommy's mommy**. "I'm sorry, I can't have a good time | The little girl was so happy that she had been able to help. |
| `One day` | One day, **Timmy's mommy's mommy's mommy**. "I'm sorry, I can't be careful and be careful. I'm sorry, I can't have a good time. | One day, she saw a big, shiny ball in the park. She wanted to play with it, but she was too scared to go. |
**结论**v041K core / 3MB 数据只学到极少的模板主语/指代崩坏LilyTimmy 混用)、立刻陷入
**"mommy's mommy's mommy"** 退化循环——它记住的是少数 n-gram没有连贯的故事建模v18.39M core /
全量数据能稳定保持单一主角写出有场景sunshine/park)、有情节saw a dog scared)、跨句一致的
完整小故事并正确生成 `<|endoftext|>` 分隔下一篇——这正是 TinyStories 想要 tiny 模型学会的东西
**val loss 低 1.22 + 采样从"循环复读"到"连贯叙事"**v1 是相对 v0 的清晰可量化提升
## xserv 验证
导出 HF Qwen3 safetensors命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16 T9 `docs/08`91 tensors
存入 registry 后用 `xserv-cli` 加载并贪心生成——** token 对住 xtrain 自身的贪心生成**闭环在 v1 规模仍成立
```
$ xserv-cli ~/projects/tiny-models/v1-tinystories-dim256 --max-tokens 40
Model: qwen3, layers=8, hidden=256, heads=8/8 kv, vocab=50257
Loaded 91 tensors
xserv> Once upon a time, there was a little girl named Lily. She loved to play outside in the
sunshine. One day, she saw a big, scary dog. The dog was scared and didn't know what to do
xserv> The little girl was so happy that she had been able to help.
xserv> One day, she saw a big, shiny ball in the park. She wanted to play with it, but she was
too scared to go.
```
## v2 提案
v1 val 曲线一路单调下到末步无过拟合= **欠拟合**说明同规模再多喂步数/数据还能降建议 v2 沿两个轴同时拉
- **数据/步数**把训练 token ~5M 拉到 ~50-100MDDP 2-4 卡把 wall-clock 压回 ~30minT8 路径已就绪
只需把 `train_ddp` 也接上参数化 config + cache + best-val checkpoint)。
- **模型**dim 384 / 12 heads·32 / 12 layers / ffn 1536 core **27M**仍是 tiny但容量翻 ~3x)。
词表不变 embed+lm_head ~38.6M ~66M
阶梯已参数化v2 只改 `--dim/--heads/--layers/--ffn/--steps` flag + DDP 启动不动模型代码
预期 val loss 进一步明显下降目标 < 2.2采样在更长上下文/更复杂情节上更稳
</content>