docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases

Add per-run design+result docs for the two Chinchilla-axis runs that were
done but never committed:
- v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale,
  best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain
  still incremental, greedy repetition remains.
- v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed
  eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814.

Extend the comparison tables in docs/runs/README.md and docs/evolution.md to
v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No
code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-29 16:18:48 +08:00
parent a1370446fe
commit 5c27493a90
5 changed files with 402 additions and 31 deletions

View File

@@ -6,14 +6,16 @@ inference side). A learning project: hand-write the entire training-systems stac
gradient checkpointing), then use it to run a multi-version **scaling study** that maps
the data-vs-capacity frontier for a tiny model.
> **Status: complete — two phases.**
> **Status: complete — three phases.**
> **Phase 1** = the from-scratch full stack (T1T13) + an 8-version scaling study (v0v8):
> hand-write the whole training-systems stack, then map the data-vs-capacity frontier.
> **Phase 2** = systems-stack depth (T14T18): hand-write the five deferred training-stack
> features — fused flash-attention, real GQA, gradient accumulation, process-per-GPU DDP,
> dropout. Trains a Qwen3-compatible LM whose weights load into **xserv** and generate
> **token-identical** output — the closed loop held byte-for-byte across both phases. This
> README is the capstone; per-topic detail lives in [`docs/`](docs/).
> dropout. **Phase 3** = one Chinchilla-style double-axis run (v9): dim1280 true-GQA +
> 6.01B FineWeb tokens, validating the v8 conclusion that data and capacity must scale
> together. Trains Qwen3-compatible LMs whose weights load into **xserv**; deterministic
> gates stay byte-identical, while large BF16 checkpoints are served and checked for
> prompt-level drift. This README is the capstone; per-topic detail lives in [`docs/`](docs/).
---
@@ -34,7 +36,8 @@ borrows, the rest hand-written CUDA + Rust:
Every op's backward is verified against **finite differences** and against **PyTorch**
(forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
load into xserv (Qwen3, BF16) producing token-identical greedy output — the closed loop.
load into xserv (Qwen3, BF16); deterministic fixtures produce token-identical greedy output,
and large checkpoints are validated end-to-end in the serving path.
## The build journey — Phase 1 (T1T13) + Phase 2 (T14T18)
@@ -106,7 +109,7 @@ Each is opt-in, kept the default path **bit-identical**, and held a **hard corre
residual ~5×@8; with all 8 GPUs at 9599% util, the residual is the **NCCL all-reduce + PCIe
topology wall**, not context serialization. The third profile-first falsification (see below).
## The scaling study — v0 → v8
## The scaling study — v0 → v10
Same Qwen3-style architecture throughout; we scaled **dim** and **data** and read out val
loss (full per-run detail in [`docs/runs/`](docs/runs/)).
@@ -119,11 +122,13 @@ loss (full per-run detail in [`docs/runs/`](docs/runs/)).
| v6 | FineWeb-edu 1.02ep | 768 / 127M | 3.07\* | **corpus swap → graduates to real text** |
| v7 | FineWeb-edu 1.45ep | 768 / 127M | 3.01\* | same subset, more epochs → near-ceiling |
| **v8** | FineWeb-edu 1.05ep | **1024 / 226M** | **2.98\*** | **capacity → helps** |
| **v9** | FineWeb-edu 6.01B / ~1ep | **1280 / 357M + GQA** | **2.89\*** | **data + capacity → helps** |
| **v10** | FineWeb-edu 6.76B / ~1ep | **1280 / 357M + GQA** | **2.88\*** | **data-only top-up → small gain** |
\* FineWeb-edu val is a different (harder) distribution — **not comparable** to the
TinyStories val of v0v5. Judge v6+ by sample quality + transfer, not the number.
### Three findings
### Four findings
1. **Data volume saturates.** TinyStories at dim768: 3.5× more tokens (v4→v5) bought only
5% val, curve flat. The narrow synthetic corpus is exhausted at this model size.
@@ -132,10 +137,18 @@ TinyStories val of v0v5. Judge v6+ by sample quality + transfer, not the numb
historical/scientific expository prose. (Cost: TinyStories transfer val 1.11 → 2.75.)
3. **Capacity helps.** v8 (dim1024, ~1 epoch) beats both v6 (dim768, same epoch, by 0.085)
and v7 (dim768, *more* data, by 0.035) → the dim768 runs were partly capacity-limited.
4. **Double-axis scale helps.** v9 scales both axes (dim1280/core357M + 6.01B FineWeb tokens)
and beats v8 by another 0.095 val loss (~3.2%). The direction is validated, but the gain is
still incremental and greedy decoding still repeats.
5. **Moving validation tails must stop.** v10 added one more FineWeb shard and got moving-tail
val 2.8816, but appending data moves the held-out tail. A fixed eval v1 was created from the
shard010 tail: v6/v7/v8/v9/v10 = 3.2328 / 3.1850 / 3.1515 / 2.9278 / 2.8814. Future runs
should report this fixed eval first.
**Meta-finding:** every *single*-axis lever (data volume, corpus breadth, capacity) is now
worth only **~3%**. Per the Chinchilla lesson, further gains require scaling **data and
capacity together** — single-axis moves are exhausted.
**Meta-finding:** every lever is now in the **~3% or smaller** regime. Single-axis moves were
exhausted by v8; v9 confirms Chinchilla-style double-axis scale works; v10 shows a data-only
top-up mostly adapts to the new shard. The next useful run should change model/context, not just
append another shard.
## Efficiency — throughput & MFU
@@ -166,18 +179,18 @@ versions — a fixed-MFU estimate is off by up to ~100× for the early launch-bo
the line: flash == composed SDPA (grads/PyTorch), GQA group=1 bit-identical to MHA, gradient
accumulation `accum=1` bit-identical, dropout p=0 bit-identical *and* dropout × recompute
bit-exact, the default path unchanged on every feature, and the **xserv closed-loop md5
byte-identical (`b04fc9f9`) throughout both phases**.
- **The closed loop matters.** Exporting to xserv and checking token-identical greedy output
caught real bugs and proved the whole stack end-to-end.
byte-identical (`b04fc9f9`) throughout the deterministic gates**.
- **The closed loop matters.** Exporting to xserv and checking generated continuations caught
real bugs and proved the whole stack end-to-end.
## Running it
Everything trains on a remote 8× RTX 5090 box; model artifacts live in a registry
(`tiny-models/v0…v8`). Serve any trained version in xserv:
(`tiny-models/v0…v10`). Serve any trained version in xserv:
```bash
# on the GPU box
cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v8-fineweb-edu-dim1024 --max-tokens 100
cargo run -p xserv-model --release --bin xserv-cli -- <registry>/v10-fineweb-edu-dim1280-gqa-data6765 --max-tokens 100
# then type a prompt, e.g. In science,
```
@@ -192,6 +205,6 @@ cargo test --workspace # autograd grad-checks, PyTorch parity, DDP, e
## Doc index
- [`docs/evolution.md`](docs/evolution.md) per-milestone changes across algorithm / architecture / infra / dataset.
- [`docs/runs/README.md`](docs/runs/README.md) the v0v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) per-run detail.
- [`docs/runs/README.md`](docs/runs/README.md) the v0v10 comparison; [`docs/runs/0N-*.md`](docs/runs/) per-run detail.
- [`docs/00-*` … `17-*`](docs/) per-phase design docs (build chain tensor autograd transformer training perf distributed export batched allocator bf16 recompute flash-attention GQA grad-accum process-per-GPU dropout).
- [`docs/known-issues.md`](docs/known-issues.md) perf backlog (KI-1/2/3/5 fixed; process-per-GPU CLOSED = measured no-op; KI-4 = accepted modeling tradeoff).