docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -11,15 +11,18 @@ xserv 格式验证可服务。
|
||||
|
||||
## 对比表
|
||||
|
||||
val loss 一栏给的是**同一 held-out 1M token**(v1 train 末尾切片)上、用 `bin/train --eval-ckpt`
|
||||
对两个 checkpoint 各自评出来的——同一指标、公平对比。
|
||||
val loss 一栏给的是各版**各自训练 run 报告的 best val**(held-out 1M token,全量 train 末尾切片)。
|
||||
注:v0/v1 训练用 seq128、v2 用 seq256,eval 窗口不同 → 同一保留集 + 同一 eval 设置(seq256/64batch)
|
||||
重评 v1=2.6756→v2=2.0418(低 0.634,apples-to-apples);下表 best-val 同向。
|
||||
|
||||
| 版本 | 数据 | 架构 (dim/L/heads·hd/ffn) | core 参数 | 总参数 | val loss | 备注 |
|
||||
|---|---|---|---|---|---|---|
|
||||
| [v0-baseline](../../docs/05-training-loop.md) | TinyStories valid 3MB 切片 (~72 万 tok) | 32 / 4 / 2·16 / 64 | ~41K | 3.26M | **3.8050** | 太小不可用;采样陷入 "mommy's mommy's mommy" 循环 |
|
||||
| [v1-tinystories-dim256](01-v1-tinystories-dim256.md) | TinyStories **全量 train** (468.3M tok, u16 缓存) | 256 / 8 / 8·32 / 1024 | 8.39M | 34.13M | **2.5847** | 全量数据 + dim256/8L;val 低 1.22,采样连贯成篇;~25.9min/单卡 |
|
||||
| [v2-tinystories-dim384](02-v2-tinystories-dim384.md) | TinyStories 全量 (复用 v1 缓存, 训 ~36.9M tok) | 384 / 12 / 12·32 / 1536 | 28.32M | 66.92M | **1.7055** | dim384/12L + **DDP 4 卡**;val 比 v1 低 0.88,情节更长;~2.8h/4 卡。⚠️ DDP 弱扩展见 [KI-1](../known-issues.md) |
|
||||
|
||||
## 下一档(提案)
|
||||
|
||||
- **v2**(待派发):见 `01-v1-*.md` 末尾 "v2 提案"。
|
||||
- **v3**(待派发):见 `02-v2-*.md` 末尾 "v3 提案"——先修 KI-1(加大 global batch 恢复 DDP 扩展),
|
||||
再放大 dim512/16L (~75M core) + 更多步数,TinyStories 接近上限后上更广语料。
|
||||
</content>
|
||||
|
||||
Reference in New Issue
Block a user