xtrain

Author	SHA1	Message	Date
Gahow Wang	f26db882e5	Merge t16-grad-accum into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # docs/evolution.md	2026-06-18 00:37:11 +08:00
Gahow Wang	8bd7db16e1	docs: T16 grad-accum results — evolution row + README build-journey dash5-verified gate numbers: accum=N bit-close to N× big batch (loss 8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big → 7.2GB accum, −74%), xserv closed loop md5-identical + token-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:52:32 +08:00
Gahow Wang	9064ced4c2	docs: T14 flash-attention results + evolution/README rows Fill in the design doc's measured results (grad-check, flash==composed, PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to evolution.md (算法/Infra) and the README build-journey table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:34:10 +08:00
Gahow Wang	511f35d40c	docs: run v8 — dim1024 capacity helps (val 2.98) v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute. 8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h. Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 / v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085 AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were partly capacity-limited, scaling capacity > repeating old data. But the gain is only ~3% (same magnitude as the data-axis single-step lever), and v8's val was still descending at the end (not saturated). Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6, capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress, scale capacity AND data together (Chinchilla, reproduced at toy scale). - docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples - docs/runs/README.md: +v8 row, v9 proposal - docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 15:12:01 +08:00
Gahow Wang	9c557f0609	docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01) v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256), trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's 1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to registry v7-fineweb-edu-dim768, serves in xserv (coherent expository English, ~v6 quality). Key finding: more epochs of the SAME subset gave only ~0.05 val drop and the curve flattened (~step 44000) with no sampling quality gain → the 2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's TinyStories data-volume saturation: repeating old data has thin margins; true further gains need FRESH shards (more diverse tokens), as v6's corpus-swap (which raised the ceiling) showed. Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis ceiling note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 03:55:47 +08:00
Gahow Wang	b4bb426d48	docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution) 第一版脱离 TinyStories：纯 FineWeb-edu 真实网页文本(2.255B 语料)，架构同 v4/v5(dim768/18L, core 127.43M)，8 卡 DDP bf16，2.29B tok/1.02ep，~1.9h @218K tok/s。train 11.03→3.14，best/final FineWeb val 3.0652。方法论：FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实网页熵高，~3.0 是预期非回退；判据是采样质量 + transfer eval。 - 新增 docs/runs/06-v6-fineweb-edu-dim768.md：数据管线(scripts/fineweb_to_txt.py) / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) / 方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律掉进小故事) - README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案 - evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 22:21:43 +08:00
Gahow Wang	88bec270af	docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:30:52 +08:00

7 Commits