docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases

Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:18:48 +08:00
parent a1370446fe
commit 5c27493a90
5 changed files with 402 additions and 31 deletions
--- a/README.md
+++ b/README.md
@@ -6,14 +6,16 @@ inference side). A learning project: hand-write the entire training-systems stac
 gradient checkpointing), then use it to run a multi-version **scaling study** that maps
 the data-vs-capacity frontier for a tiny model.

-> **Status: complete — two phases.**
+> **Status: complete — three phases.**
 > **Phase 1** = the from-scratch full stack (T1–T13) + an 8-version scaling study (v0–v8):
 > hand-write the whole training-systems stack, then map the data-vs-capacity frontier.
 > **Phase 2** = systems-stack depth (T14–T18): hand-write the five deferred training-stack
 > features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP,
-> dropout. Trains a Qwen3-compatible LM whose weights load into **xserv** and generate
-> **token-identical** output — the closed loop held byte-for-byte across both phases. This
-> README is the capstone; per-topic detail lives in [`docs/`](docs/).
+> dropout. **Phase 3** = one Chinchilla-style double-axis run (v9): dim1280 true-GQA +
+> 6.01B FineWeb tokens, validating the v8 conclusion that data and capacity must scale
+> together. Trains Qwen3-compatible LMs whose weights load into **xserv**; deterministic
+> gates stay byte-identical, while large BF16 checkpoints are served and checked for
+> prompt-level drift. This README is the capstone; per-topic detail lives in [`docs/`](docs/).

 ---

@@ -34,7 +36,8 @@ borrows, the rest hand-written CUDA + Rust:

 Every op's backward is verified against **finite differences** and against **PyTorch**
 (forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
-load into xserv (Qwen3, BF16) producing token-identical greedy output — the closed loop.
+load into xserv (Qwen3, BF16); deterministic fixtures produce token-identical greedy output,
+and large checkpoints are validated end-to-end in the serving path.

 ## The build journey — Phase 1 (T1–T13) + Phase 2 (T14–T18)

@@ -106,7 +109,7 @@ Each is opt-in, kept the default path **bit-identical**, and held a **hard corre
  residual ~5×@8; with all 8 GPUs at 95–99% util, the residual is the **NCCL all-reduce + PCIe
  topology wall**, not context serialization. The third profile-first falsification (see below).

-## The scaling study — v0 → v8
+## The scaling study — v0 → v10

 Same Qwen3-style architecture throughout; we scaled **dim** and **data** and read out val
 loss (full per-run detail in [`docs/runs/`](docs/runs/)).
@@ -119,11 +122,13 @@ loss (full per-run detail in [`docs/runs/`](docs/runs/)).
 | v6 | FineWeb-edu 1.02ep | 768 / 127M | 3.07\* | **corpus swap → graduates to real text** |
 | v7 | FineWeb-edu 1.45ep | 768 / 127M | 3.01\* | same subset, more epochs → near-ceiling |
 | **v8** | FineWeb-edu 1.05ep | **1024 / 226M** | **2.98\*** | **capacity → helps** |
+| **v9** | FineWeb-edu 6.01B / ~1ep | **1280 / 357M + GQA** | **2.89\*** | **data + capacity → helps** |
+| **v10** | FineWeb-edu 6.76B / ~1ep | **1280 / 357M + GQA** | **2.88\*** | **data-only top-up → small gain** |

 \* FineWeb-edu val is a different (harder) distribution — **not comparable** to the
 TinyStories val of v0–v5. Judge v6+ by sample quality + transfer, not the number.

-### Three findings
+### Four findings

 1. **Data volume saturates.** TinyStories at dim768: 3.5× more tokens (v4→v5) bought only
   −5% val, curve flat. The narrow synthetic corpus is exhausted at this model size.
@@ -132,10 +137,18 @@ TinyStories val of v0–v5. Judge v6+ by sample quality + transfer, not the numb
   historical/scientific expository prose. (Cost: TinyStories transfer val 1.11 → 2.75.)
 3. **Capacity helps.** v8 (dim1024, ~1 epoch) beats both v6 (dim768, same epoch, by 0.085)
   and v7 (dim768, *more* data, by 0.035) → the dim768 runs were partly capacity-limited.
+4. **Double-axis scale helps.** v9 scales both axes (dim1280/core357M + 6.01B FineWeb tokens)
+   and beats v8 by another 0.095 val loss (~3.2%). The direction is validated, but the gain is
+   still incremental and greedy decoding still repeats.
+5. **Moving validation tails must stop.** v10 added one more FineWeb shard and got moving-tail
+   val 2.8816, but appending data moves the held-out tail. A fixed eval v1 was created from the
+   shard010 tail: v6/v7/v8/v9/v10 = 3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814. Future runs
+   should report this fixed eval first.

-**Meta-finding:** every *single*-axis lever (data volume, corpus breadth, capacity) is now
-worth only **~3%**. Per the Chinchilla lesson, further gains require scaling **data and
-capacity together** — single-axis moves are exhausted.
+**Meta-finding:** every lever is now in the **~3% or smaller** regime. Single-axis moves were
+exhausted by v8; v9 confirms Chinchilla-style double-axis scale works; v10 shows a data-only
+top-up mostly adapts to the new shard. The next useful run should change model/context, not just
+append another shard.

 ## Efficiency — throughput & MFU

@@ -166,18 +179,18 @@ versions — a fixed-MFU estimate is off by up to ~100× for the early launch-bo
  the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient
  accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute
  bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5
-  byte-identical (`b04fc9f9`) throughout both phases**.
- **The closed loop matters.** Exporting to xserv and checking token-identical greedy output
-  caught real bugs and proved the whole stack end-to-end.
+  byte-identical (`b04fc9f9`) throughout the deterministic gates**.
+- **The closed loop matters.** Exporting to xserv and checking generated continuations caught
+  real bugs and proved the whole stack end-to-end.

 ## Running it

 Everything trains on a remote 8× RTX 5090 box; model artifacts live in a registry
-(`tiny-models/v0…v8`). Serve any trained version in xserv:
+(`tiny-models/v0…v10`). Serve any trained version in xserv:

 ```bash
 # on the GPU box
-cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v8-fineweb-edu-dim1024 --max-tokens 100
+cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v10-fineweb-edu-dim1280-gqa-data6765 --max-tokens 100
 # then type a prompt, e.g.  In science,
 ```

@@ -192,6 +205,6 @@ cargo test --workspace            # autograd grad-checks, PyTorch parity, DDP, e
 ## Doc index

 - [`docs/evolution.md`](docs/evolution.md) — per-milestone changes across algorithm / architecture / infra / dataset.
- [`docs/runs/README.md`](docs/runs/README.md) — the v0–v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) — per-run detail.
+- [`docs/runs/README.md`](docs/runs/README.md) — the v0–v10 comparison; [`docs/runs/0N-*.md`](docs/runs/) — per-run detail.
 - [`docs/00-*` … `17-*`](docs/) — per-phase design docs (build chain → tensor → autograd → transformer → training → perf → distributed → export → batched → allocator → bf16 → recompute → flash-attention → GQA → grad-accum → process-per-GPU → dropout).
 - [`docs/known-issues.md`](docs/known-issues.md) — perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff).