xtrain/docs/runs/06-v6-fineweb-edu-dim768.md

# Scaling Run v6: 脱离 TinyStories — 纯 FineWeb-edu 真实网页文本 + dim768/18L(同 v4/v5) + 8 卡 DDP bf16 — Design Document

## Goal

v5 给了一个干净的结论：**dim768/127M-core 在 TinyStories 上已近数据饱和**（同 arch 数据 ×3.5 仅
val ↓5%、末段走平）。v5 末尾的判断是「瓶颈在**语料**而非容量」——再榨 TinyStories 的 epoch 收益薄，
真正的杠杆是**换更广、更真的语料**。

v6 就是兑现这条判断的第一步：**第一版彻底脱离 TinyStories，换成纯 FineWeb-edu**（真实教育类网页文本）。

1. **架构完全冻结 = v4/v5**（dim 768 / 24 heads × 32 head_dim / 18 layers / SwiGLU ffn 2048，
   core 127.43M，总 204.63M）。**一个权重维度都不改**——和 v5「只动数据量」一样，v6「只动数据来源」，
   把「TinyStories → FineWeb-edu」做成唯一被测变量。
2. **数据从玩具语料毕业到真实网页**：v0–v5 都吃 TinyStories（GPT 合成的、词汇受控的幼儿小故事）；
   v6 换成 FineWeb-edu（`sample/10BT` 子集，2.255B-token 真实教育类网页文本）。这不是为了「更低的 val」，
   而是为了**让模型见到真实世界的语言分布**（历史/科学/技术/说明文），测「更丰富的数据 → 更丰富的语言」。
3. **bf16 + 8 卡 global 256（同 v5 的甜点区）**：复用 T12 的 bf16 混合精度（fp32 master），稳态
   ~218K tok/s，~1.9h 训完 2.29B token（~1.02 epoch）。

> ### ⚠️ 方法论说明（本版最重要的一条）
>
> **v6 的 val loss（FineWeb-edu 3.0652）和 v0–v5 的 val loss（TinyStories ~1.1）不在同一把尺子上，
> 不能直接比大小。** TinyStories 是合成的、词汇与句法都受控的幼儿故事，**熵很低**——一个学得好的模型
> 能把 val 压到 ~1.1。FineWeb-edu 是真实网页文本，主题、词汇、句式无穷无尽，**熵本就高出一大截**，
> 同尺寸模型在它上面的 val 落在 ~3.0 是**完全预期的，不是回退**。
>
> 所以 **v6 不该用「val 3.07 比 v5 的 1.11 差」来读**。本版的真正判据是两条：
> **(a) 通用提示词下的采样质量**（v6 是否能写出连贯的真实英文，而不是掉进小故事）；
> **(b) transfer eval**（v6 在 TinyStories 留出集上的表现，量化「换通用数据」对原分布的代价）。

## 数据（v6 的真正变化点）

| 项 | v5 | v6 |
|----|----|----|
| 来源 | TinyStories 全量 train（合成幼儿故事）| **FineWeb-edu**（HuggingFaceFW/fineweb-edu, `sample/10BT`，真实教育类网页）|
| 语料规模 | 468.26M tokens | **2,254,904,418 tokens**（3 个 parquet 分片）|
| **训练消费 token** | ~2.49B（38000 步）| **~2.29B**（35000 步 × global 256 × seq 256）|
| epoch 占比 | ~5.33 | **~1.02** |
| tokenizer | gpt2 BPE（vocab 50257）| **同（刻意不换，隔离数据变量）** |
| 缓存 | `data/tinystories-train.txt.u16.bin` | `data/fineweb-edu.txt.u16.bin`（4.51GB u16）|
| held-out val | TinyStories 末尾 1M token | **FineWeb-edu 末尾 1M token（与 v0–v5 不可比，分布不同）** |

**新数据管线**（`scripts/fineweb_to_txt.py`）：从 FineWeb-edu 的 parquet 分片**按 row-group 流式**抽
`text` 列，每篇文档后接一个 `<|endoftext|>`（gpt2 id 50256，Corpus 的文档边界），写成一个 UTF-8 `.txt`；
再走 `Corpus::load_cached` 的 gpt2 BPE 一次性 tokenize → `.u16.bin` 缓存（免重复 BPE）。整条管线**只新增
这一个脚本**，Corpus / tokenizer / 训练器全部复用 v0–v5 的既有代码——这正是「只动数据来源」的工程边界。
（冗余的 10.45GB `.txt` 训完即删，可由脚本 + parquet 重新生成；4.51GB 的 `.u16.bin` 缓存留在 dash5。）

**tokenizer 刻意不换**：gpt2 BPE 对真实网页未必是最优词表（KI-4「大词表小 vocab」仍在台账上），但 v6
若同时换 tokenizer，val 的变化就无法干净归因到「数据来源」。**保 gpt2 → 隔离数据变量**；换 tokenizer 留给
后续版本单独做。

## 架构

v6 = **与 v4/v5 字节级同构的** tiny Qwen3（RoPE + RMSNorm + per-head QK-norm + SwiGLU + 独立 lm_head，MHA）。
**刻意一个维度都不改**，让「数据来源」成为唯一被测变量。

| 维度 | v4/v5 | v6 |
|------|-------|----|
| dim（= heads·head_dim）| 768 | **768（同）** |
| n_layers | 18 | **18（同）** |
| n_heads / head_dim | 24 / 32 | **24 / 32（同）** |
| ffn_hidden（SwiGLU）| 2048 | 2048（同）|
| vocab | 50257 | 50257（同）|
| **core 参数** | 127,432,704（≈127.43M）| **127,432,704（同）** |
| **总参数** | 204,627,456（≈204.63M）| **204,627,456（同）** |

config.json 与 v4/v5 一字不差（导出的 **201 tensors** 形状完全相同）。

## 训练器：8 卡 DDP bf16（同 v5）

复用 v5 的训练栈，无改动：

- **fp32 master 权重 + AdamW/clip/DDP 全部 fp32**，linears 走 `cublasGemmEx`（16BF / fp32 accum）、激活
  存 bf16；norm/softmax/rope/CE 仍 fp32。bf16 撑住 per-rank batch 32 / global 256 的甜点区。
- **8 卡 thread-per-GPU**：all-reduce device 梯度取均值后各 rank 本地 GpuAdamW step，跨 rank 参数
  bit-identical（T8/T11）。
- 全程稳态 **~218,000 tok/s**、wall-clock **~1.9h** 训完 2.29B token。

## 超参

| 项 | 值 | 备注 |
|----|----|----|
| optimizer | 手写 AdamW（GPU 端 step）| wd=0.1，β/eps 用 xtrain-optim 默认 |
| LR schedule | 线性 warmup → cosine decay | max_lr **6e-4** → min_lr **6e-5**（同 v1–v5）|
| warmup | ~1750 步（steps/20）| lr 在 ~step 1750 达峰 6e-4，cosine 衰减到末步 6e-5 |
| grad clip | global-norm 1.0 | warmup 后 gnorm ~0.4 起平稳 |
| steps | **35000** | ~1.9h @ 8 卡 |
| global batch | **256**（per-rank 32 × world 8）| bf16 甜点区（同 v5）|
| seq_len | **256** | 同 v2–v5 |
| tokens/step | 256×256 = 65536 | 总训练 token ≈ **2.29B**（~1.02 epoch）|
| world size | **8**（RTX 5090，sm_120）| |
| 精度 | **bf16 混合精度**（fp32 master）| T12/KI-2；导出 xserv 同样 BF16 |

## 结果

- **train loss**：start **11.0273** → end **3.1442**（全程平滑下降）
- **best / final val loss（FineWeb-edu held-out 1M token，step 34999）**：**3.0652**
- FineWeb val 曲线（抽样，**单调下降无走平** —— 与 v5 末段抖动相反）：

| step | 499 | 1999 | 3999 | 5999 | 7999 | 9999 | 13999 | 17999 | 21999 | 25999 | 29999 | 33999 | **34999** |
|------|-----|------|------|------|------|------|-------|-------|-------|-------|-------|-------|-------|
| val  | 5.6987 | 4.0900 | 3.6906 | 3.5501 | 3.4681 | 3.4057 | 3.3089 | 3.2383 | 3.1779 | 3.1268 | 3.0907 | 3.0683 | **3.0652** |

**注意曲线形状**：v6 的 FineWeb val **到末步仍在单调降**（3.0683 → 3.0652，无 v5 那种 ~0.004 带内抖动）。
对真实网页这本「信息量大得多」的语料，1.02 epoch / 2.29B token **远未触到数据天花板**——继续训（更多
epoch 或更大模型）val 还有明显下降空间。这与 v5 在 TinyStories 上 5.33 epoch 走平形成鲜明对照：**换了语料，
天花板被抬高了。**

### ⚠️ 为什么 3.07 不能和 v5 的 1.11 比

| | v5 | v6 |
|---|---|---|
| val 语料 | TinyStories（合成幼儿故事，低熵）| FineWeb-edu（真实网页，高熵）|
| best val | **1.1102** | **3.0652** |
| 可比性 | ← **不可比** → | 两套 held-out 来自完全不同的分布 |

**3.07 > 1.11 ≠ v6 比 v5 差。** 一本词汇受控、句式套路化的幼儿故事集，本来就能被压到很低的 val；真实网页
文本的内在熵高出一截，同尺寸模型落在 ~3.0 是**预期值**。判 v6 的好坏，看下面两节（采样质量 + transfer），
不看这个不可比的数字。

## Transfer eval — v6 在 TinyStories 留出集上

把 v6 的 ckpt 拿到**与 v5 完全相同的 TinyStories 1M 留出集**上跑 `--eval-ckpt`（bf16），量化「纯通用数据
训练」对 TinyStories 原分布的代价：

| 模型 | TinyStories 1M val | 说明 |
|------|--------------------|------|
| v5（native TinyStories）| **1.1102** | 直接在 TinyStories 上训出来的 |
| **v6（FineWeb-edu）→ TinyStories** | **2.7546** | 从没见过 TinyStories，纯 transfer |

**v6 在 TinyStories 上 2.75，比 v5 的 1.11 高 +1.64 nats。** 这是预期的、也是有意义的：v6 **从未训练过
TinyStories**，它把容量花在了真实网页分布上，对「the little girl named Lily」这种高度套路化的幼儿故事
反而生疏了。换句话说，**纯通用数据训练对窄分布（TinyStories）有明确代价**——v6 用 TinyStories 专有的
流畅度，换来了通用网页文本的能力。这正是「换轴」该有的样子：不是免费午餐，是一次有方向的权衡。

> 这条 transfer 也回答了一个隐含问题：v6 的 2.75 远好于一个随机模型（~11），说明 FineWeb-edu 学到的
> 通用语言能力**确实迁移到了** TinyStories（英语语法、常见词、叙事结构都通用），只是没有 v5 那种针对性。

## 采样对比 —— v5(TinyStories) vs v6(FineWeb-edu)，同通用提示词

两个模型**同 arch、同 xserv 服务、同贪心、同 prompt**，唯一差别是训练语料。喂**通用/说明文**提示词
（不是 "Once upon a time"），看语言走向：

| prompt | **v5（TinyStories）** | **v6（FineWeb-edu）** |
|--------|----------------------|------------------------|
| `The history of` | the castle was very interesting… **The little girl was so excited to explore the castle.** She ran around… | the United States is **a fascinating one. It is a country that has been shaped by the experiences of its people. From the founding of the United States to the present day…** |
| `In science,` | **the little girl learned about different kinds of plants and animals.** She was so excited to learn more. | the term science is used to refer to the study of the physical world. **Science is a broad field that encompasses biology, chemistry, physics, and engineering.** |
| `The most important` | thing was that **the little girl was safe.** She was so happy that she had been able to help the bird. `<\|endoftext\|>` | thing is to have a good understanding of **the different types of data and how they are used… it can help you make decisions about your business.** |
| `The United States` | of the world was a very special place. **Everyone was happy and they all looked forward to the future.** `<\|endoftext\|>` | has a long history… **the process of assimilation** (轻微史实幻觉，但语域是历史说明文)|
| `Water is` | **not good for you… Lily and Ben nodded. They drank water and felt better. They learned their lesson.** | a natural resource that is used to produce energy. **It is a renewable resource that can be used to generate electricity.** |

**结论很直接**：

- **v5 无法脱离小故事模式**。每一个通用提示词都被它掰回 TinyStories 叙事——"the little girl"、"Lily and
  Ben"、"learned their lesson"、`<|endoftext|>` 收尾。它**只会**写幼儿故事，因为它**只见过**幼儿故事。
- **v6 写出真实说明文英文**——历史、科学学科、数据/商业决策、可再生资源。语域明确是**教育类网页文本**，
  正是 FineWeb-edu 的分布。小模型仍有重复倾向（greedy 尤甚，如 "decisions about your business" 重复一次），
  也有轻微史实幻觉（"Rockefeller coined Americanization"），但**语言的种类和广度**是 v5 完全没有的。

这就是 v6 这版要回答的问题的答案：**更丰富、更真实的语料 → 更丰富、更通用的语言**。代价是窄分布上的
专有流畅度（transfer 2.75 已量化），收益是从「只会一种文体」到「能写真实世界的说明文」。

## xserv 验证

导出 HF Qwen3 safetensors（命名映射 + 2D 权重转置 [in,out]→[out,in] + BF16，见 T9 `docs/08`，
**201 tensors**，config.json 与 v4/v5 一字不差）存入 registry，用 `xserv-cli` 加载并贪心生成：

```
$ xserv-cli ~/projects/tiny-models/v6-fineweb-edu-dim768 --max-tokens 50
Model: qwen3, layers=18, hidden=768, heads=24/24 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
xserv> The history of the United States is a fascinating one. It is a country that has been shaped by
       the experiences of its people. From the founding of the United States to the present day, there
       are many stories that have shaped the country.
xserv> In science, the term science is used to refer to the study of the physical world. Science is a
       broad field that encompasses a wide range of disciplines, including biology, chemistry, physics,
       and engineering.
xserv> Water is a natural resource that is used to produce energy. It is a renewable resource that can
       be used to generate electricity.
```

**token-match**：v6 **训练即 bf16**（fp32 master），权重本就在 bf16 数值域里收敛，导出 BF16 给 xserv 后
两侧数值路径一致——同 v4/v5 的闭环（v5 是 3/3 逐 token 一致）。xserv 加载 qwen3 layers=18 hidden=768
201 tensors、KV-cache、贪心生成，闭环成立。

## 相比 v5 与 v7 提案

v5 给出**数据天花板**结论（TinyStories 在 dim768 饱和），v6 兑现了「换轴：广化语料」这条路——结果是
**语言种类的质变**（小故事 → 真实说明文），且 FineWeb val 到末步仍单调降 = **新语料下天花板被抬高、
1.02 epoch 远未触顶**。v7 的杠杆按收益排序：

1. **更多/更好数据（首选）**：v6 才训 1.02 epoch、val 还在单调降——**同 arch 多喂 FineWeb-edu**（2–3 epoch
   或加更多 10BT 分片）几乎肯定继续降 val，是当前最便宜、最确定的收益。
2. **数据混合（次选，治 transfer 退化）**：v6 暴露了「纯通用数据伤窄分布」（TinyStories transfer 1.11→2.75）。
   若想同时要**连贯性 + 广度**，可上 **TinyStories + FineWeb-edu 混合**语料——但这是为「不退化」服务，不是
   为「更低通用 val」服务，优先级看目标。
3. **更大模型（dim1024+，要 KI-3）**：v6 证明真实语料的信息量对 127M-core 还有富余（val 未饱和）→ 更大模型
   能从同语料榨更多。但 dim1024+ 激活显存上升，需先做 **KI-3 激活重计算**（T12 已列为 bf16 之后的下一个
   显存杠杆）。代价最高，留到「数据这条便宜杠杆榨干后」再上。

**我的判断：v7 先走 1（更多 FineWeb-edu epoch）**——v6 的曲线明确告诉我们「这本语料还没喂够」，在动模型
尺寸（贵、要 KI-3）之前，先把已经铺好的数据轴吃满是性价比最高的下一步。