docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases

Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-29 16:18:48 +08:00
parent a1370446fe
commit 5c27493a90
5 changed files with 402 additions and 31 deletions

View File

@@ -6,14 +6,16 @@ inference side). A learning project: hand-write the entire training-systems stac
gradient checkpointing), then use it to run a multi-version **scaling study** that maps gradient checkpointing), then use it to run a multi-version **scaling study** that maps
the data-vs-capacity frontier for a tiny model. the data-vs-capacity frontier for a tiny model.
> **Status: complete — two phases.** > **Status: complete — three phases.**
> **Phase 1** = the from-scratch full stack (T1T13) + an 8-version scaling study (v0v8): > **Phase 1** = the from-scratch full stack (T1T13) + an 8-version scaling study (v0v8):
> hand-write the whole training-systems stack, then map the data-vs-capacity frontier. > hand-write the whole training-systems stack, then map the data-vs-capacity frontier.
> **Phase 2** = systems-stack depth (T14T18): hand-write the five deferred training-stack > **Phase 2** = systems-stack depth (T14T18): hand-write the five deferred training-stack
> features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP, > features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP,
> dropout. Trains a Qwen3-compatible LM whose weights load into **xserv** and generate > dropout. **Phase 3** = one Chinchilla-style double-axis run (v9): dim1280 true-GQA +
> **token-identical** output — the closed loop held byte-for-byte across both phases. This > 6.01B FineWeb tokens, validating the v8 conclusion that data and capacity must scale
> README is the capstone; per-topic detail lives in [`docs/`](docs/). > together. Trains Qwen3-compatible LMs whose weights load into **xserv**; deterministic
> gates stay byte-identical, while large BF16 checkpoints are served and checked for
> prompt-level drift. This README is the capstone; per-topic detail lives in [`docs/`](docs/).
--- ---
@@ -34,7 +36,8 @@ borrows, the rest hand-written CUDA + Rust:
Every op's backward is verified against **finite differences** and against **PyTorch** Every op's backward is verified against **finite differences** and against **PyTorch**
(forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and (forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
load into xserv (Qwen3, BF16) producing token-identical greedy output — the closed loop. load into xserv (Qwen3, BF16); deterministic fixtures produce token-identical greedy output,
and large checkpoints are validated end-to-end in the serving path.
## The build journey — Phase 1 (T1T13) + Phase 2 (T14T18) ## The build journey — Phase 1 (T1T13) + Phase 2 (T14T18)
@@ -106,7 +109,7 @@ Each is opt-in, kept the default path **bit-identical**, and held a **hard corre
residual ~5×@8; with all 8 GPUs at 9599% util, the residual is the **NCCL all-reduce + PCIe residual ~5×@8; with all 8 GPUs at 9599% util, the residual is the **NCCL all-reduce + PCIe
topology wall**, not context serialization. The third profile-first falsification (see below). topology wall**, not context serialization. The third profile-first falsification (see below).
## The scaling study — v0 → v8 ## The scaling study — v0 → v10
Same Qwen3-style architecture throughout; we scaled **dim** and **data** and read out val Same Qwen3-style architecture throughout; we scaled **dim** and **data** and read out val
loss (full per-run detail in [`docs/runs/`](docs/runs/)). loss (full per-run detail in [`docs/runs/`](docs/runs/)).
@@ -119,11 +122,13 @@ loss (full per-run detail in [`docs/runs/`](docs/runs/)).
| v6 | FineWeb-edu 1.02ep | 768 / 127M | 3.07\* | **corpus swap → graduates to real text** | | v6 | FineWeb-edu 1.02ep | 768 / 127M | 3.07\* | **corpus swap → graduates to real text** |
| v7 | FineWeb-edu 1.45ep | 768 / 127M | 3.01\* | same subset, more epochs → near-ceiling | | v7 | FineWeb-edu 1.45ep | 768 / 127M | 3.01\* | same subset, more epochs → near-ceiling |
| **v8** | FineWeb-edu 1.05ep | **1024 / 226M** | **2.98\*** | **capacity → helps** | | **v8** | FineWeb-edu 1.05ep | **1024 / 226M** | **2.98\*** | **capacity → helps** |
| **v9** | FineWeb-edu 6.01B / ~1ep | **1280 / 357M + GQA** | **2.89\*** | **data + capacity → helps** |
| **v10** | FineWeb-edu 6.76B / ~1ep | **1280 / 357M + GQA** | **2.88\*** | **data-only top-up → small gain** |
\* FineWeb-edu val is a different (harder) distribution — **not comparable** to the \* FineWeb-edu val is a different (harder) distribution — **not comparable** to the
TinyStories val of v0v5. Judge v6+ by sample quality + transfer, not the number. TinyStories val of v0v5. Judge v6+ by sample quality + transfer, not the number.
### Three findings ### Four findings
1. **Data volume saturates.** TinyStories at dim768: 3.5× more tokens (v4→v5) bought only 1. **Data volume saturates.** TinyStories at dim768: 3.5× more tokens (v4→v5) bought only
5% val, curve flat. The narrow synthetic corpus is exhausted at this model size. 5% val, curve flat. The narrow synthetic corpus is exhausted at this model size.
@@ -132,10 +137,18 @@ TinyStories val of v0v5. Judge v6+ by sample quality + transfer, not the numb
historical/scientific expository prose. (Cost: TinyStories transfer val 1.11 → 2.75.) historical/scientific expository prose. (Cost: TinyStories transfer val 1.11 → 2.75.)
3. **Capacity helps.** v8 (dim1024, ~1 epoch) beats both v6 (dim768, same epoch, by 0.085) 3. **Capacity helps.** v8 (dim1024, ~1 epoch) beats both v6 (dim768, same epoch, by 0.085)
and v7 (dim768, *more* data, by 0.035) → the dim768 runs were partly capacity-limited. and v7 (dim768, *more* data, by 0.035) → the dim768 runs were partly capacity-limited.
4. **Double-axis scale helps.** v9 scales both axes (dim1280/core357M + 6.01B FineWeb tokens)
and beats v8 by another 0.095 val loss (~3.2%). The direction is validated, but the gain is
still incremental and greedy decoding still repeats.
5. **Moving validation tails must stop.** v10 added one more FineWeb shard and got moving-tail
val 2.8816, but appending data moves the held-out tail. A fixed eval v1 was created from the
shard010 tail: v6/v7/v8/v9/v10 = 3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814. Future runs
should report this fixed eval first.
**Meta-finding:** every *single*-axis lever (data volume, corpus breadth, capacity) is now **Meta-finding:** every lever is now in the **~3% or smaller** regime. Single-axis moves were
worth only **~3%**. Per the Chinchilla lesson, further gains require scaling **data and exhausted by v8; v9 confirms Chinchilla-style double-axis scale works; v10 shows a data-only
capacity together** — single-axis moves are exhausted. top-up mostly adapts to the new shard. The next useful run should change model/context, not just
append another shard.
## Efficiency — throughput & MFU ## Efficiency — throughput & MFU
@@ -166,18 +179,18 @@ versions — a fixed-MFU estimate is off by up to ~100× for the early launch-bo
the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient
accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute
bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5 bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5
byte-identical (`b04fc9f9`) throughout both phases**. byte-identical (`b04fc9f9`) throughout the deterministic gates**.
- **The closed loop matters.** Exporting to xserv and checking token-identical greedy output - **The closed loop matters.** Exporting to xserv and checking generated continuations caught
caught real bugs and proved the whole stack end-to-end. real bugs and proved the whole stack end-to-end.
## Running it ## Running it
Everything trains on a remote 8× RTX 5090 box; model artifacts live in a registry Everything trains on a remote 8× RTX 5090 box; model artifacts live in a registry
(`tiny-models/v0…v8`). Serve any trained version in xserv: (`tiny-models/v0…v10`). Serve any trained version in xserv:
```bash ```bash
# on the GPU box # on the GPU box
cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v8-fineweb-edu-dim1024 --max-tokens 100 cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v10-fineweb-edu-dim1280-gqa-data6765 --max-tokens 100
# then type a prompt, e.g. In science, # then type a prompt, e.g. In science,
``` ```
@@ -192,6 +205,6 @@ cargo test --workspace # autograd grad-checks, PyTorch parity, DDP, e
## Doc index ## Doc index
- [`docs/evolution.md`](docs/evolution.md) per-milestone changes across algorithm / architecture / infra / dataset. - [`docs/evolution.md`](docs/evolution.md) per-milestone changes across algorithm / architecture / infra / dataset.
- [`docs/runs/README.md`](docs/runs/README.md) the v0v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) per-run detail. - [`docs/runs/README.md`](docs/runs/README.md) the v0v10 comparison; [`docs/runs/0N-*.md`](docs/runs/) per-run detail.
- [`docs/00-*` … `17-*`](docs/) per-phase design docs (build chain tensor autograd transformer training perf distributed export batched allocator bf16 recompute flash-attention GQA grad-accum process-per-GPU dropout). - [`docs/00-*` … `17-*`](docs/) per-phase design docs (build chain tensor autograd transformer training perf distributed export batched allocator bf16 recompute flash-attention GQA grad-accum process-per-GPU dropout).
- [`docs/known-issues.md`](docs/known-issues.md) perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff). - [`docs/known-issues.md`](docs/known-issues.md) perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff).

View File

@@ -33,9 +33,9 @@
--- ---
## 二、Scaling runsv0v8)—— 主要动「模型架构」与「数据集」 ## 二、Scaling runsv0v10)—— 主要动「模型架构」与「数据集」
架构始终是 **Qwen3-style**RoPE + RMSNorm + QK-norm + SwiGLUgpt2 50257 词表),逐版放大 dim/层/头v8 起首次拨容量轴到 dim1024其余维度逐版变化如下 架构始终是 **Qwen3-style**RoPE + RMSNorm + QK-norm + SwiGLUgpt2 50257 词表),逐版放大 dim/层/头v8 起首次拨容量轴到 dim1024v9 进入 dim1280+真 GQA 双轴点v10 固定架构只补数据轴);其余维度逐版变化如下:
| ver | 模型架构dim/层/头·hd · 核心/总参) | 数据集(语料 · 实训 token · epoch | 算法/精度 | InfraGPU · 吞吐) | 结果val · 备注) | | ver | 模型架构dim/层/头·hd · 核心/总参) | 数据集(语料 · 实训 token · epoch | 算法/精度 | InfraGPU · 吞吐) | 结果val · 备注) |
|---|---|---|---|---|---| |---|---|---|---|---|---|
@@ -48,22 +48,27 @@
| v6 | dim768/18L同 v4/v5 | **FineWeb-edu** 真实网页 · 2.29B · 1.02ep | bf16 | 8 GPU · 218K | val **3.07**:⚠️**FineWeb 留出集,与 v0v5 不可比**(真实网页熵高,~3.0 是预期);判据=采样质量+transfer。第一版脱离 TinyStories**语言种类质变**小故事→真实说明文transfer→TinyStories val 2.75(v5 native 1.11)纯通用数据对窄分布有代价val 末步仍单调降=未饱和 | | v6 | dim768/18L同 v4/v5 | **FineWeb-edu** 真实网页 · 2.29B · 1.02ep | bf16 | 8 GPU · 218K | val **3.07**:⚠️**FineWeb 留出集,与 v0v5 不可比**(真实网页熵高,~3.0 是预期);判据=采样质量+transfer。第一版脱离 TinyStories**语言种类质变**小故事→真实说明文transfer→TinyStories val 2.75(v5 native 1.11)纯通用数据对窄分布有代价val 末步仍单调降=未饱和 |
| v7 | dim768/18L同 v4/v5/v6 | **同 v6 的 FineWeb-edu 子集**(非新数据)· 3.28B · **1.45ep** | bf16 | 8 GPU · 218K | val **3.01**(与 v6 可比):⚠️**同子集多 epoch 近天花板**——唯一变量=epoch(1.02→1.45),多喂 ~1B token val 仅 ↓0.05 且 ~step44000 后走平、采样无质变。与 v5 的 TinyStories 数据量饱和同类(重复老数据边际薄);真·更多数据要**新 shards** | | v7 | dim768/18L同 v4/v5/v6 | **同 v6 的 FineWeb-edu 子集**(非新数据)· 3.28B · **1.45ep** | bf16 | 8 GPU · 218K | val **3.01**(与 v6 可比):⚠️**同子集多 epoch 近天花板**——唯一变量=epoch(1.02→1.45),多喂 ~1B token val 仅 ↓0.05 且 ~step44000 后走平、采样无质变。与 v5 的 TinyStories 数据量饱和同类(重复老数据边际薄);真·更多数据要**新 shards** |
| v8 | **dim1024**/18L/**32h** · **226M/329M**+78% 容量ffn 2730 | **同 v6/v7 的 FineWeb-edu 子集**(非新数据)· 2.36B · **1.05ep** | bf16 **+ 激活重计算(T13)** | 8 GPU · 129K重算税 | val **2.98**(与 v6/v7 可比):⭐**容量轴 A/B——容量有用**:唯一变量=dim768→dim1024同 ~1ep v6 3.07→**2.98**↓0.085),且 v8(1.05ep) < v7(1.45ep 更多老数据) 3.01 放大容量 > 重复老数据 ⇒ v6/v7 部分 capacity-limited。⚠但增益仅 ~3%、val 末步**仍在降未饱和** ⇒ **单轴(数据/容量)单步都已 ~3%/lever = 全面边际递减,要双轴一起 scale(Chinchilla)** | | v8 | **dim1024**/18L/**32h** · **226M/329M**+78% 容量ffn 2730 | **同 v6/v7 的 FineWeb-edu 子集**(非新数据)· 2.36B · **1.05ep** | bf16 **+ 激活重计算(T13)** | 8 GPU · 129K重算税 | val **2.98**(与 v6/v7 可比):⭐**容量轴 A/B——容量有用**:唯一变量=dim768→dim1024同 ~1ep v6 3.07→**2.98**↓0.085),且 v8(1.05ep) < v7(1.45ep 更多老数据) 3.01 放大容量 > 重复老数据 ⇒ v6/v7 部分 capacity-limited。⚠但增益仅 ~3%、val 末步**仍在降未饱和** ⇒ **单轴(数据/容量)单步都已 ~3%/lever = 全面边际递减,要双轴一起 scale(Chinchilla)** |
| v9 | **dim1280**/18L/**40h/10kv GQA** · **357M/486M**ffn 4096 | **FineWeb-edu 扩展 shards 000-009** · **6.01B** · **~1.00ep** | bf16 + recompute + **flash + grad-accum + true GQA** | 8 GPU · **78.6K**21.25h | val **2.8854**(与 v6-v8 可比):✅**双轴 Chinchilla 点有效**——容量从 v8 226M→357M同时数据从 2.255B 子集→6.013B tokenbest val 比 v8 再降 **0.0947 (~3.2%)**。采样写真实说明文更稳一些,但 greedy 重复仍明显;收益仍是稳健增量而非质变 |
| v10 | **同 v9** | **FineWeb-edu 扩展 shards 000-010** · **6.765B** · **~1.00ep** | bf16 + recompute + flash + grad-accum + true GQA | 8 GPU · **79.0K**23.86h | moving-tail val **2.8816**;固定 eval v1 上 v9 **2.9278**→v10 **2.8814**。结论:补 shard010 对新分布有效,但只补数据轴不解决 greedy 重复;后续应固定 eval set并优先试更大模型+长 context |
> 实训 token = steps×batch×seq非数据集大小。val 同一 1M-token TinyStories 留出集v0v5 可比;v6 起换 FineWeb-edu 留出集,分布不同、与 v0v5 不可比v6/v7/v8 同一 FineWeb 留出集、三版彼此可比 3.07/3.01/2.98)。 > 实训 token = steps×batch×seq非数据集大小v0v5 的 val 同一 1M-token TinyStories 留出集v6 起换 FineWeb-edu
> 且 v9/v10 追加新 shards 会移动默认 tail-heldout严格横比改用 fixed eval v1shard010 tail 1M
> v6/v7/v8/v9/v10 = **3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814**。
--- ---
## 三、各维度的累积演进(轴向看一条线怎么走的) ## 三、各维度的累积演进(轴向看一条线怎么走的)
- **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14online softmax + flash 式 bwd) → 梯度累积(T16复用 tape SUM等效大 batch 而显存随 micro) → dropout(T18counter-based 设备 RNG + inverted scalingtrain/eval 切换)。 - **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14online softmax + flash 式 bwd) → 梯度累积(T16复用 tape SUM等效大 batch 而显存随 micro) → dropout(T18counter-based 设备 RNG + inverted scalingtrain/eval 切换)。
- **模型架构**:固定 Qwen3-styledim **32→256→384→512→768→1024**v8 首拨容量轴,头数 24→32);核心参数 **41K→226M**(总 3.26M→329M。+QK-norm(T9Qwen3 兼容) → **真 GQA(T15`num_kv_heads<num_heads`repeat_kv broadcast + 组内梯度求和;默认=nh→MHA 逐位回归)**——架构补齐到现代 LLM 标配MHA/GQA/MQA 一条 `num_kv_heads` 轴),两条 SDPA(composed/flash) 共用同一 broadcast导出真 `num_key_value_heads` 且 xserv 闭环。 - **模型架构**:固定 Qwen3-styledim **32→256→384→512→768→1024→1280**v8 首拨容量轴,v9 进入 dim1280);核心参数 **41K→357M**(总 3.26M→486M。+QK-norm(T9Qwen3 兼容) → **真 GQA(T15`num_kv_heads<num_heads`repeat_kv broadcast + 组内梯度求和;默认=nh→MHA 逐位回归v9 用 40 query / 10 kv**——架构补齐到现代 LLM 标配MHA/GQA/MQA 一条 `num_kv_heads` 轴),两条 SDPA(composed/flash) 共用同一 broadcast导出真 `num_key_value_heads` 且 xserv 闭环。
- **Infra**:单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13解锁 dim1024) → flash-attention(T14不物化 N×Nattention 显存收益随 seq 增长) → 梯度累积(T16DDP 只在累积边界通信,显存随 micro 不随有效 batch) → process-per-GPU(T17torchrun 式独立进程/CUDA context复用 T8 train_rank 零改动)。吞吐 **3.3K→217K tok/s**dim768 bf16dim1024+重算 ~129K重算税MFU **0.4%→17%**(每次提升都对应一块 perf 基建,详见 known-issues + MFU 分析。T13/T14/T16 是三条**显存杠杆**重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存),可叠加放大有效 batch。**T17 实测=负结果记账**process-per-GPU 在本尺度对吞吐**中性**thread ~5.27× vs proc ~5.31×@8,差<1% 噪声8 卡全 9599% util 残留非线性是 NCCL/PCIe 通信墙、**** context 串行—— KI-5/T11 doc 长挂的process-per-GPU 是残留串行的解猜想实测钉死推翻方法论同 T11 证伪分桶 all-reduce」)。 - **Infra**:单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13解锁 dim1024) → flash-attention(T14不物化 N×Nattention 显存收益随 seq 增长) → 梯度累积(T16DDP 只在累积边界通信,显存随 micro 不随有效 batch) → process-per-GPU(T17torchrun 式独立进程/CUDA context复用 T8 train_rank 零改动)。吞吐 **3.3K→217K tok/s**dim768 bf16dim1024+重算 ~129K重算税MFU **0.4%→17%**(每次提升都对应一块 perf 基建,详见 known-issues + MFU 分析。T13/T14/T16 是三条**显存杠杆**重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存),可叠加放大有效 batch。**T17 实测=负结果记账**process-per-GPU 在本尺度对吞吐**中性**thread ~5.27× vs proc ~5.31×@8,差<1% 噪声8 卡全 9599% util 残留非线性是 NCCL/PCIe 通信墙、**** context 串行—— KI-5/T11 doc 长挂的process-per-GPU 是残留串行的解猜想实测钉死推翻方法论同 T11 证伪分桶 all-reduce」)。
- **数据集**TinyStories 3MB 切片 全量 TinyStoriesepoch 0.015.33**至饱和**)→ **v6 毕业到 FineWeb-edu 真实网页**2.255B 语料1.02ep)→ **v7 同子集多 epoch1.45ep,近顶)→ v8 同子集换大模型**dim10241.05ep)。tokenizer 全程 gpt2 BPE复用 xserv-tokenizerv6 刻意不换 tokenizer 以隔离数据来源变量KI-4 留后续版本)。 - **数据集**TinyStories 3MB 切片 全量 TinyStoriesepoch 0.015.33**至饱和**)→ **v6 毕业到 FineWeb-edu 真实网页**2.255B 语料1.02ep)→ **v7 同子集多 epoch1.45ep,近顶)→ v8 同子集换大模型**dim10241.05ep **v9 扩新 FineWeb shards 到 6.013B token 并同步放大模型** **v10 补 shard010 到 6.765B token只拨数据轴**tokenizer 全程 gpt2 BPE复用 xserv-tokenizer保闭环优先KI-4 接受)。
- **v5v6 数据轴的质变**v0v5 都吃合成幼儿故事TinyStories低熵词汇受控v5 证明同尺寸模型在它上面已饱和v6 第一版换成**真实教育类网页文本**FineWeb-edu语言种类发生质变——采样从只会写小故事变成能写历史/科学/说明文」。 - **v5v6 数据轴的质变**v0v5 都吃合成幼儿故事TinyStories低熵词汇受控v5 证明同尺寸模型在它上面已饱和v6 第一版换成**真实教育类网页文本**FineWeb-edu语言种类发生质变——采样从只会写小故事变成能写历史/科学/说明文」。
- **同子集多 epoch 也有天花板v6→v7**v6 FineWeb val 才训 1.02ep末步仍单调降曾被读作还没喂够」;v7 **同一 2.255B 子集**喂到 1.45ep ~1B tokenFineWeb val 0.053.073.01 ~step44000 后走平采样无质变 **该子集在 dim768 已近天花板**这与 v5 TinyStories 数据量饱和是**同一类现象****「重复喂老数据边际都薄无论是 v5 的同语料多 epoch 还是 v7 的同子集多 epoch**。真正抬天花板的是 v6换更广的新语料那一步——**杠杆在更多样的新 token」,不在同数据多读几遍」**。后续要继续降 val必须补** FineWeb shards**更多样不重复不是同子集加 epoch - **同子集多 epoch 也有天花板v6→v7**v6 FineWeb val 才训 1.02ep末步仍单调降曾被读作还没喂够」;v7 **同一 2.255B 子集**喂到 1.45ep ~1B tokenFineWeb val 0.053.073.01 ~step44000 后走平采样无质变 **该子集在 dim768 已近天花板**这与 v5 TinyStories 数据量饱和是**同一类现象****「重复喂老数据边际都薄无论是 v5 的同语料多 epoch 还是 v7 的同子集多 epoch**。真正抬天花板的是 v6换更广的新语料那一步——**杠杆在更多样的新 token」,不在同数据多读几遍」**。后续要继续降 val必须补** FineWeb shards**更多样不重复不是同子集加 epoch
- **val 可比性**v0v5 val 是同一 TinyStories 1M 留出集彼此可比**v6 起换 FineWeb-edu 留出集分布不同val 不能和 v0v5~1.1比大小**——真实网页熵高~3.0 是预期而非回退**v6/v7/v8 同一 FineWeb 留出集三版彼此可比**3.073.012.98v6 的判据还有采样质量 + **transfer eval**v6TinyStories val 2.75 vs v5 native 1.11量化纯通用数据对窄分布的代价」) - **val 可比性**v0v5 val 是同一 TinyStories 1M 留出集彼此可比**v6 起换 FineWeb-edu 留出集分布不同val 不能和 v0v5~1.1比大小**——真实网页熵高~3.0 是预期而非回退v9/v10 追加 shards 后默认 tail-heldout 会移动不能再只看 moving-tail best为后续建立 fixed eval v1shard010 tail 1Mv6/v7/v8/v9/v10 = **3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814**
- **容量轴有用,但也只有 ~3%v8**v6/v7 dim768 吃不动更多数据」,v8 用最干净的 A/B 回答了是数据见够还是容量不够」——**冻结数据子集纯把 dim768dim1024core 127M226M+78%** ~1 epoch FineWeb val **3.07→2.98↓0.085** v81.05ep还低于 v71.45ep 更多老数据 3.01。⇒ **容量有用v6/v7 部分是 capacity-limited不全是数据见够**放大容量比给小模型多喂老数据更值。**但增益只有 ~3%**与数据轴单步杠杆同量级 - **容量轴有用,但也只有 ~3%v8**v6/v7 dim768 吃不动更多数据」,v8 用最干净的 A/B 回答了是数据见够还是容量不够」——**冻结数据子集纯把 dim768dim1024core 127M226M+78%** ~1 epoch FineWeb val **3.07→2.98↓0.085** v81.05ep还低于 v71.45ep 更多老数据 3.01。⇒ **容量有用v6/v7 部分是 capacity-limited不全是数据见够**放大容量比给小模型多喂老数据更值。**但增益只有 ~3%**与数据轴单步杠杆同量级
- 🧭 **元结论:单轴单步都已 ~3%/lever = 全面边际递减,要双轴一起 scaleChinchilla 小尺度复现)**把三条轴并起来看——数据量轴v5/v7 同子集多 epoch饱和~1.65%/)、数据广度轴v6 换语料是一次性换分布红利)、容量轴v8有用但 ~3%)——** v8任何单轴的单步杠杆都收敛到 ~3%/lever**。 v8 容量 +78% 却只配同样的 2.36B tokenval 末步仍在降 数据立刻成新瓶颈。⇒ **要继续进步,容量与数据必须匹配地一起 scale而不是单独猛拨一根轴**——这正是 Chinchilla 在这个 toy 尺度上的复现 - **双轴一起 scale 有效v9**v9 v8 的提案落地模型 core 226M357M数据 2.255B 子集6.013B token实训 6.012Bbest FineWeb val **2.9801→2.8854**再降 **0.0947 (~3.2%)**这确认 Chinchilla 式双轴方向正确但收益仍是 ~3% 级稳健增量greedy 重复仍在说明小尺度下更好 val尚未完全转化成肉眼质变
- 📌 **只补数据轴边际有限v10**v10 保持 v9 架构仅补 shard010 6.765B tokenfixed eval v1 v9 2.9278v10 2.8814说明新 shard 分布被学到 moving-tail best 只从 2.88542.8816 greedy 复读不变下一步更值得改模型/context而不是继续一片片补数据
## 三·五、Phase 2 系统栈深度综合T14T18 五条特性按四维收束) ## 三·五、Phase 2 系统栈深度综合T14T18 五条特性按四维收束)

View File

@@ -0,0 +1,152 @@
# Scaling Run v9: Chinchilla 双轴 — dim1280/18L true GQA(core 356.9M) + FineWeb-edu 6.01B token + Phase-2 stack — Design Document
## Goal
v8 给出的元结论是:单独拨容量轴有用,但只有约 3% 的边际;单独重复旧数据也只有约 1.6% 的边际。要继续明显超过
v8必须把 **模型容量 + 新 token** 一起放大,而不是只拨一根轴。
v9 就是这个双轴点:
1. **模型轴**dim1024/core 226M -> **dim1280/core 356.9M**,同时启用真 GQA40 query heads / 10 kv heads
2. **数据轴**v6-v8 的 2.255B FineWeb 子集 -> **6.013B token**,追加了新 FineWeb-edu shards 003-009。
3. **系统栈**:使用 Phase-2 现代路径:`--flash + --accum-steps + bf16 + recompute + DDP`。dropout 设为 0按标准预训练。
> v9 的 val 仍是 FineWeb-edu 分布,不能和 v0-v5 的 TinyStories val 直接比。注意v9 扩展 cache 后默认
> tail-heldout 已经从 v6-v8 的旧 tail 移到新 shards 末尾;严格横比后续以 fixed eval v1 为准。
## Data
| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`,原 shards 000-002 + 新 shards 003-009 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,013,639,492** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,012,639,492 token |
| 训练消费 token | **6,012,600,320** = 91745 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |
P3-DATA 目标本来是约 7B tokenshard 010 下载 `curl rc=18` 中断,所以最终停在 6.01B。对 core 356.9M 来说,
D/N 约 **16.8 token/param**,低于理想 Chinchilla 20但已经远高于 v8 的约 10.4,是一个干净的双轴 scale 点。
## Architecture
| 项 | v8 | **v9** |
|----|----|----|
| dim | 1024 | **1280** |
| layers | 18 | 18 |
| query heads x head_dim | 32 x 32 | **40 x 32** |
| kv heads | 32 (MHA) | **10 (true GQA, group=4)** |
| ffn | 2730 | **4096** |
| core params | 226.50M | **356.89M** |
| total params | 329.42M | **485.55M** |
| export tensors | 201 | **201** |
`config.json` writes real `num_key_value_heads = 10`, so xserv loads v9 as true GQA rather than MHA.
## Training
| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **91745** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **21h15m** |
| steady throughput | **~78.6K tok/s** |
| peak observed memory | ~17GB / GPU |
Command:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
--steps 91745 --batch 128 --accum-steps 2 --seq 256 \
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
--ckpt /dashscope-tmp/wjh/xtrain_v9.ckpt
```
## Results
- train loss: **11.1550 -> 2.9340**
- first val: step 1000 = **5.1517**
- best val: step 91000 = **2.8854**
- final val: step 91745 = **2.8873**
- exit code: **0**
FineWeb val curve milestones:
| step | 1000 | 10000 | 20000 | 30000 | 40000 | 50000 | 60000 | 70000 | 80000 | 90000 | 91000 | final |
|------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.1517 | 3.4820 | 3.2953 | 3.2026 | 3.1422 | 3.0844 | 3.0148 | 2.9616 | 2.9160 | 2.8915 | **2.8854** | 2.8873 |
The curve kept improving into the last 1K-step window, then the final eval bounced slightly from 2.8854 to 2.8873. This is close to
the floor for this run, but not a clear overfit failure.
## Comparison
| | v6 | v7 | v8 | **v9** |
|---|---|---|---|---|
| model | dim768/core127M | dim768/core127M | dim1024/core226M | **dim1280/core357M + GQA** |
| data | 2.29B | 3.28B same subset | 2.36B same subset | **6.01B expanded shards** |
| best val | 3.0652 | 3.0149 | 2.9801 | **2.8854** |
On the run-local moving tail, v9 beats v8 by **0.0947** val loss (~3.2% relative), essentially the same size as the
v6->v8 capacity gain but now on top of it. A later fixed eval v1 check still supports the same direction
(v8 3.1515 -> v9 2.9278 on shard010-tail holdout), while making the moving-tail caveat explicit. This confirms
the v8 prediction: **双轴 scale 有效**. It is still an incremental gain, not a qualitative jump.
## Samples
xserv greedy samples (`--max-tokens 60`) are more coherent than the v8 examples on some prompts, but repetition remains:
```text
[The history of] the United States is the story of the people, the places, and the events that have shaped the nation...
[In science,] the term "scientific method" is used to describe the process of gathering information and testing it...
[The most important] thing is to be aware of the symptoms and to seek medical attention...
[Water is] a natural resource that is essential for human life...
```
The model writes real explanatory English and the domain mix is FineWeb-like. Greedy decoding still falls into repeated clauses on
some prompts (`scientific method`, symptoms, and earlier fixed prompts), so the val gain is more visible in the metric than in a
dramatic sample-quality leap.
## xserv validation
Registry path:
```text
/opt/wjh/projects/tiny-models/v9-fineweb-edu-dim1280-gqa
```
Files:
- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
- `RUN.md`
xserv loads v9 as:
```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```
Token-match check against xtrain greedy (`max-tokens 40`):
- `Once upon a time`: xtrain and xserv matched through the checked continuation.
- `One day`: diverged after "large, dark," (`very tall man` vs `metallic object`) from BF16 greedy tie sensitivity.
- `The little`: same repetitive pattern, with a short BF16 path divergence.
This is the same class of BF16-vs-f32 greedy drift seen in v8; the important integration result is that xserv successfully loads
true GQA (`kv_heads=10 < heads=40`) and generates from the exported weights.

View File

@@ -0,0 +1,200 @@
# Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document
## Goal
v9 证明了双轴 scale更大模型 + 更多新 token有效best val 从 v8 的 2.9801 降到 2.8854。
但 v9 的数据量只有 6.013B tokenD/N 约 16.8,低于 Chinchilla 经验里的 20。v10 的目标很窄:
1. **只补数据轴**:补上 v9 中断的 FineWeb-edu shard010把 cache 从 6.013B 推到 6.765B。
2. **架构不变**:完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
3. **验证边际**:看 D/N 从 16.8 到 18.95 是否还能显著降低 val。
## Data
| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`shards 000-010 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,765,333,808** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,764,333,808 token |
| 训练消费 token | **6,764,298,240** = 103215 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |
Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后v10 的 val tail
和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一
验证集上的横比。
为了解决这个问题,本轮创建了固定 eval set
```text
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
```
它包含 shard010 末尾 11M token前 10M token 只是为了复用现有 `split_tail(val_tokens=1M)`,真正 eval 的是最后
1M token。该 fixed eval v1 对 v6-v9 都是未见数据;对 v10 也是训练时 held-out。
## Architecture
v10 与 v9 完全相同:
| 项 | 值 |
|----|----|
| dim | 1280 |
| layers | 18 |
| query heads x head_dim | 40 x 32 |
| kv heads | 10 (true GQA, group=4) |
| ffn | 4096 |
| core params | 356.89M |
| total params | 485.55M |
| export tensors | 201 |
## Training
| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **103215** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **23h51m** |
| steady throughput | **~79.0K tok/s** |
| peak observed memory | ~17GB / GPU |
Command:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
/opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
--heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
--steps 103215 --batch 128 --accum-steps 2 --seq 256 \
--max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
--eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
--ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt
```
## Results
- train loss: **11.1575 -> 2.9000**
- first val: step 999 = **5.3048**
- best val: step 103214 = **2.8816**
- final val: step 103214 = **2.8816**
- exit code: **0**
FineWeb moving-tail val milestones:
| step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final |
|------|-----|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | **2.8816** |
The curve still improved at the final eval. There is no overfit signal in this run.
## Fixed Eval V1
Fixed eval v1 (`shard010 tail 1M`, seq256, 64 eval batches):
| version | fixed eval v1 |
|---------|---------------|
| v6 | 3.2328 |
| v7 | 3.1850 |
| v8 | 3.1515 |
| v9 | 2.9278 |
| **v10** | **2.8814** |
This is the cleanest cross-version result in the v10 round. It says:
- v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
- v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
- The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.
## Decoding
Greedy decoding still repeats. Fixed prompts from xtrain:
```text
[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...
```
Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:
```text
[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...
```
xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:
```text
[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...
```
Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and
repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or `xserv-cli` path used for weight validation. A clean
next step is to add a raw generation tool with `temperature/top-p/repetition-penalty` so decoding experiments do not depend on chat
templates.
## xserv Validation
Registry path:
```text
/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765
```
Files:
- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
xserv loads v10 as true GQA:
```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```
## v11 Feasibility: Bigger Model + Longer Context
A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.
Candidate:
| item | value |
|------|-------|
| dim / layers | 1536 / 20 |
| heads / kv_heads | 48 / 12 |
| ffn | 6144 |
| core / total params | 684.26M / 838.65M |
| stack | bf16 + recompute + flash + accum + 8 GPU DDP |
Smoke results:
| seq | batch / accum | effective batch | peak mem | tok/s | result |
|-----|---------------|-----------------|----------|-------|--------|
| 512 | 64 / 4 | 256 | **30530 MiB** | **44.7K** | 50 steps OK |
| 1024 | 32 / 8 | 256 | **30530 MiB** | **31.0K** | 20 steps OK |
Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:
- seq512: roughly **42h**
- seq1024: roughly **61h**
Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:
1. **v11a practical**: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
2. **v11b long-context**: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.
For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.

View File

@@ -22,6 +22,10 @@ val loss 一栏给的是各版**各自训练 run 报告的 best val**held-out
**v8 改测容量轴**:同 v6/v7 子集、纯把 dim768→dim1024core 127M→226MFineWeb val 3.07/3.01→**2.98** ⇒ **v8 改测容量轴**:同 v6/v7 子集、纯把 dim768→dim1024core 127M→226MFineWeb val 3.07/3.01→**2.98** ⇒
**容量有用**v6/v7 部分 capacity-limited但增益仅 ~3%、val 末步仍在降未饱和 ⇒ **到 v8数据轴与容量轴的 **容量有用**v6/v7 部分 capacity-limited但增益仅 ~3%、val 末步仍在降未饱和 ⇒ **到 v8数据轴与容量轴的
单步杠杆都收敛到 ~3%/lever = 全面边际递减,要双轴一起 scale**Chinchilla详见 [08-v8](08-v8-fineweb-edu-dim1024.md))。 单步杠杆都收敛到 ~3%/lever = 全面边际递减,要双轴一起 scale**Chinchilla详见 [08-v8](08-v8-fineweb-edu-dim1024.md))。
**v9 兑现双轴**dim1024→dim1280core 226M→357M并把 FineWeb token 从 2.255B 子集扩到 6.013B
best val **2.8854**,相比 v8 再降 0.0947~3.2%)。结论:双轴 scale 有效,但仍是稳健增量而非质变。
**v10 只补数据轴**:同 v9 架构,只补 shard010 到 6.765B tokenmoving-tail best/final val **2.8816**
注意追加 shard 会移动 held-out tail固定 eval v1 上 v6→v10 为 **3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814**
⚠️ **v6 起换了保留集(语料)**v0v5 的 val 都是 **TinyStories** 1M 留出集彼此可比v6 换成纯 ⚠️ **v6 起换了保留集(语料)**v0v5 的 val 都是 **TinyStories** 1M 留出集彼此可比v6 换成纯
**FineWeb-edu**(真实网页文本),它的 val3.07)是**另一把尺子上的另一个分布****不能**和 v0v5 的 **FineWeb-edu**(真实网页文本),它的 val3.07)是**另一把尺子上的另一个分布****不能**和 v0v5 的
@@ -39,18 +43,15 @@ val loss 一栏给的是各版**各自训练 run 报告的 best val**held-out
| [v6-fineweb-edu-dim768](06-v6-fineweb-edu-dim768.md) | **FineWeb-edu** 真实网页 (2.255B 语料) | ~2.29B | ~1.02 | 768 / 18 / 24·32 / 2048 (**同 v4/v5**) | 127.43M | 204.63M | **3.0652** ⚠️*(FineWeb val,与上不可比)* | **第一版脱离 TinyStories**,唯一变量=数据来源 + 8 卡 DDP bf16~1.9h/8 卡 ~218K tok/s。**val 是另一分布**(真实网页熵高,~3.0 是预期非回退),判据=采样质量+transfer。FineWeb val 末步仍单调降=未饱和;**transfer**: v6→TinyStories val **2.75**(v5 native 1.11),纯通用数据对窄分布有代价。采样: v6 写真实说明文 vs v5 一律掉进小故事 | | [v6-fineweb-edu-dim768](06-v6-fineweb-edu-dim768.md) | **FineWeb-edu** 真实网页 (2.255B 语料) | ~2.29B | ~1.02 | 768 / 18 / 24·32 / 2048 (**同 v4/v5**) | 127.43M | 204.63M | **3.0652** ⚠️*(FineWeb val,与上不可比)* | **第一版脱离 TinyStories**,唯一变量=数据来源 + 8 卡 DDP bf16~1.9h/8 卡 ~218K tok/s。**val 是另一分布**(真实网页熵高,~3.0 是预期非回退),判据=采样质量+transfer。FineWeb val 末步仍单调降=未饱和;**transfer**: v6→TinyStories val **2.75**(v5 native 1.11),纯通用数据对窄分布有代价。采样: v6 写真实说明文 vs v5 一律掉进小故事 |
| [v7-fineweb-edu-dim768](07-v7-fineweb-edu-dim768.md) | **同 v6 的 2.255B FineWeb-edu 子集**(非新数据) | ~3.28B | ~1.45 | 768 / 18 / 24·32 / 2048 (**同 v4/v5/v6**) | 127.43M | 204.63M | **3.0149** *(FineWeb val,与 v6 可比)* | **唯一变量=epoch 数**(1.02→1.45) + 8 卡 DDP bf16~4.2h/8 卡 ~218K tok/s。⚠**核心发现:同子集多 epoch 近天花板**——多喂 ~1B tokenval 仅 ↓0.05(3.07→3.01)且 ~step44000 后走平、采样无质变。真"更多数据"要**新 FineWeb shards**(更多样 token),非重复同一子集。与 v5 的 TinyStories 数据量饱和同类(重复老数据边际薄)v6 换语料才是抬天花板的轴 | | [v7-fineweb-edu-dim768](07-v7-fineweb-edu-dim768.md) | **同 v6 的 2.255B FineWeb-edu 子集**(非新数据) | ~3.28B | ~1.45 | 768 / 18 / 24·32 / 2048 (**同 v4/v5/v6**) | 127.43M | 204.63M | **3.0149** *(FineWeb val,与 v6 可比)* | **唯一变量=epoch 数**(1.02→1.45) + 8 卡 DDP bf16~4.2h/8 卡 ~218K tok/s。⚠**核心发现:同子集多 epoch 近天花板**——多喂 ~1B tokenval 仅 ↓0.05(3.07→3.01)且 ~step44000 后走平、采样无质变。真"更多数据"要**新 FineWeb shards**(更多样 token),非重复同一子集。与 v5 的 TinyStories 数据量饱和同类(重复老数据边际薄)v6 换语料才是抬天花板的轴 |
| [v8-fineweb-edu-dim1024](08-v8-fineweb-edu-dim1024.md) | **同 v6/v7 的 2.255B FineWeb-edu 子集**(非新数据) | ~2.36B | ~1.05 | **1024 / 18 / 32·32 / 2730** | **226.50M** | **329.42M** | **2.9801** *(FineWeb val,与 v6/v7 可比)* | **唯一变量=模型容量**(dim768→dim1024, core 127M→226M +78%) + bf16 + **激活重计算(T13)** 装下 dim1024~5h/8 卡 ~129K tok/s(重算税)。⭐**核心 A/B容量有用**——同 ~1ep v6 3.07→v8 **2.98**(↓0.085),且 v8(1.05ep) < v7(1.45ep 更多老数据) 3.01 放大容量 > 重复老数据 ⇒ v6/v7 部分 capacity-limited。⚠但增益仅 ~3%(与数据轴单步同量级)val 末步**仍在降未饱和**。**元结论:单轴(数据/容量)单步都已 ~3%/lever = 全面边际递减,要双轴一起 scale(Chinchilla)** | | [v8-fineweb-edu-dim1024](08-v8-fineweb-edu-dim1024.md) | **同 v6/v7 的 2.255B FineWeb-edu 子集**(非新数据) | ~2.36B | ~1.05 | **1024 / 18 / 32·32 / 2730** | **226.50M** | **329.42M** | **2.9801** *(FineWeb val,与 v6/v7 可比)* | **唯一变量=模型容量**(dim768→dim1024, core 127M→226M +78%) + bf16 + **激活重计算(T13)** 装下 dim1024~5h/8 卡 ~129K tok/s(重算税)。⭐**核心 A/B容量有用**——同 ~1ep v6 3.07→v8 **2.98**(↓0.085),且 v8(1.05ep) < v7(1.45ep 更多老数据) 3.01 放大容量 > 重复老数据 ⇒ v6/v7 部分 capacity-limited。⚠但增益仅 ~3%(与数据轴单步同量级)val 末步**仍在降未饱和**。**元结论:单轴(数据/容量)单步都已 ~3%/lever = 全面边际递减,要双轴一起 scale(Chinchilla)** |
| [v9-fineweb-edu-dim1280-gqa](09-v9-fineweb-edu-dim1280-gqa.md) | **FineWeb-edu 扩展 shards 000-009**(6.013B token) | **~6.01B** | ~1.00 | **1280 / 18 / 40·32 / 4096, kv=10 GQA** | **356.89M** | **485.55M** | **2.8854** *(moving-tail FineWeb val)* | **Chinchilla 双轴**dim1024→1280 + 真 GQA + 新 FineWeb tokenPhase-2 stack(`--flash`+accum+bf16+recompute+DDP)21.25h/8 卡 ~78.6K tok/s。相比 v8 moving-tail 再降 **0.0947 (~3.2%)**,验证双轴 scale 有效greedy 样本更像真实说明文但仍重复,增益主要体现在 val 而非质变 |
| [v10-fineweb-edu-dim1280-gqa-data6765](10-v10-fineweb-edu-dim1280-gqa-data6765.md) | **FineWeb-edu 扩展 shards 000-010**(6.765B token) | **~6.76B** | ~1.00 | **同 v9** | **356.89M** | **485.55M** | **2.8816** *(moving-tail FineWeb val)* | **只补数据轴**同架构从头训23.86h/8 卡 ~79.0K tok/s。moving-tail 比 v9 只低 0.0038,不宜过读;固定 eval v1 上 v9 **2.9278**→v10 **2.8814**,说明补 shard010 对新分布有效。greedy 复读未解决 |
## 下一档(提案) ## 下一档(提案)
- **v9**(待定方向):到 v8**数据量轴(v5/v7 饱和) / 数据广度轴(v6 一次性红利) / 容量轴(v8 有用但 ~3%)** 三根 - **v11**:优先走**更大模型 + 更长 context**而不是继续只补数据。smoke 已验证 dim1536/20L/48q/12kv/ffn6144
单轴都已测过,且**单步杠杆都收敛到 ~3%/lever = 全面边际递减**。Chinchilla 教训在小尺度复现v8 容量 +78% 却只配 能跑 seq512 和 seq1024但峰值约 30.5GiB,贴近 5090 32GB 上限。建议先做 v11aseq512约 42h
同样的 2.36B tokenval 末步仍在降 ⇒ 数据立刻成新瓶颈 ⇒ **容量与数据要匹配地一起 scale**。v9 选项: 或明确接受 2.5 天预算后做 v11bseq1024约 61h。v11 必须使用固定 eval v1避免 moving-tail 继续污染横比。
**1. 双轴一起 scale最符合 Chinchilla更大模型 + 新 FineWeb shards真 scale 但大投入)**
**2. dim1024 多喂数据最便宜v8 才 1.05ep 未饱和,续训到 23ep / 加新 shards直接验证容量是否被数据卡住)**
**3. 自然收尾8 版 + 从零全栈 + 三轴完整分析 + Chinchilla 边际元结论,学习线已讲完整个故事)**
详见 [08-v8](08-v8-fineweb-edu-dim1024.md) 末尾 "v9 提案"。
> **v7 时的提案(已被 v8 兑现,归档)**v7 把首选定为「新 FineWeb shards」把「更大模型(dim1024+,容量轴, > **v7 时的提案(已被 v8 兑现,归档)**v7 把首选定为「新 FineWeb shards」把「更大模型(dim1024+,容量轴,
> 需先做 T13 激活重计算)」列为待测。**v8 走了容量轴**并证明它有用(但 ~3%),把「是否 capacity-limited」从 > 需先做 T13 激活重计算)」列为待测。**v8 走了容量轴**并证明它有用(但 ~3%),把「是否 capacity-limited」从
> 悬念变成了「部分是」的结论。 > 悬念变成了「部分是」的结论。
</content>