xtrain

Author	SHA1	Message	Date
Gahow Wang	7a1fba95b5	docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check - run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:12 +08:00
Gahow Wang	5c27493a90	docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:18:48 +08:00
Gahow Wang	511f35d40c	docs: run v8 — dim1024 capacity helps (val 2.98) v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute. 8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h. Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 / v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085 AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were partly capacity-limited, scaling capacity > repeating old data. But the gain is only ~3% (same magnitude as the data-axis single-step lever), and v8's val was still descending at the end (not saturated). Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6, capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress, scale capacity AND data together (Chinchilla, reproduced at toy scale). - docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples - docs/runs/README.md: +v8 row, v9 proposal - docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 15:12:01 +08:00
Gahow Wang	9c557f0609	docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01) v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256), trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's 1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to registry v7-fineweb-edu-dim768, serves in xserv (coherent expository English, ~v6 quality). Key finding: more epochs of the SAME subset gave only ~0.05 val drop and the curve flattened (~step 44000) with no sampling quality gain → the 2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's TinyStories data-volume saturation: repeating old data has thin margins; true further gains need FRESH shards (more diverse tokens), as v6's corpus-swap (which raised the ceiling) showed. Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis ceiling note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 03:55:47 +08:00
Gahow Wang	b4bb426d48	docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution) 第一版脱离 TinyStories：纯 FineWeb-edu 真实网页文本(2.255B 语料)，架构同 v4/v5(dim768/18L, core 127.43M)，8 卡 DDP bf16，2.29B tok/1.02ep，~1.9h @218K tok/s。train 11.03→3.14，best/final FineWeb val 3.0652。方法论：FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实网页熵高，~3.0 是预期非回退；判据是采样质量 + transfer eval。 - 新增 docs/runs/06-v6-fineweb-edu-dim768.md：数据管线(scripts/fineweb_to_txt.py) / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) / 方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律掉进小故事) - README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案 - evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 22:21:43 +08:00
Gahow Wang	579365f4a0	docs: run v5 — TinyStories saturation at dim768 (val 1.11) 设计文档 05-v5-tinystories-dim768.md（中文，xserv 风格）：数据 2.49B tok/5.33ep、架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。核心发现「数据天花板」：v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平 ⇒ TinyStories 在 dim768/127M-core 近饱和，v6 该换轴（更大模型/更广语料，非更多 TinyStories）。 xserv BF16 服务 3/3 prompt 逐 token 一致。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	8a1e29543b	run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11) v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列，让 TinyStories 数据饱和可见（v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平）。下一档提案改为 v6 换轴。导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768（checkpoint/safetensors 不入库）。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	ff79fee3c5	docs: run v4 — TinyStories, dim768, val 1.17 Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep / arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128 per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8 DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU / v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32 batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table to v0/v1/v2/v3/v4 and the next-rung proposal to v5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:37 +08:00
Gahow Wang	a78502e0f0	docs: run v3 — TinyStories, dim512, val 1.30 Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:45 +08:00
Gahow Wang	64b2a8c09e	run: v3 archive + export (dim512, single-GPU batched, val 1.30) v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch), single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry ~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json + model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd diverges on BF16 drift as in v1/v2). best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out 1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward) validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU avoids KI-5. Update docs/runs/README.md comparison table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:36 +08:00
Gahow Wang	bf679f6f1f	docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71) Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4 RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58 and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts (3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat (global batch too small → all-reduce dominates) → links docs/known-issues KI-1; v3 proposal applies KI-1's fix (much larger global batch). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 22:38:31 +08:00
Gahow Wang	264660527f	docs: run v1 — TinyStories full, dim256 docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table. v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M). Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:09:46 +08:00

12 Commits