xtrain/docs/runs/09-v9-fineweb-edu-dim1280-gqa.md

# Scaling Run v9: Chinchilla 双轴 — dim1280/18L true GQA(core 356.9M) + FineWeb-edu 6.01B token + Phase-2 stack — Design Document

## Goal

v8 给出的元结论是：单独拨容量轴有用，但只有约 3% 的边际；单独重复旧数据也只有约 1.6% 的边际。要继续明显超过
v8，必须把 **模型容量 + 新 token** 一起放大，而不是只拨一根轴。

v9 就是这个双轴点：

1. **模型轴**：dim1024/core 226M -> **dim1280/core 356.9M**，同时启用真 GQA（40 query heads / 10 kv heads）。
2. **数据轴**：v6-v8 的 2.255B FineWeb 子集 -> **6.013B token**，追加了新 FineWeb-edu shards 003-009。
3. **系统栈**：使用 Phase-2 现代路径：`--flash + --accum-steps + bf16 + recompute + DDP`。dropout 设为 0，按标准预训练。

> v9 的 val 仍是 FineWeb-edu 分布，不能和 v0-v5 的 TinyStories val 直接比。注意：v9 扩展 cache 后默认
> tail-heldout 已经从 v6-v8 的旧 tail 移到新 shards 末尾；严格横比后续以 fixed eval v1 为准。

## Data

| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`，原 shards 000-002 + 新 shards 003-009 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,013,639,492** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,012,639,492 token |
| 训练消费 token | **6,012,600,320** = 91745 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |

P3-DATA 目标本来是约 7B token；shard 010 下载 `curl rc=18` 中断，所以最终停在 6.01B。对 core 356.9M 来说，
D/N 约 **16.8 token/param**，低于理想 Chinchilla 20，但已经远高于 v8 的约 10.4，是一个干净的双轴 scale 点。

## Architecture

| 项 | v8 | **v9** |
|----|----|----|
| dim | 1024 | **1280** |
| layers | 18 | 18 |
| query heads x head_dim | 32 x 32 | **40 x 32** |
| kv heads | 32 (MHA) | **10 (true GQA, group=4)** |
| ffn | 2730 | **4096** |
| core params | 226.50M | **356.89M** |
| total params | 329.42M | **485.55M** |
| export tensors | 201 | **201** |

`config.json` writes real `num_key_value_heads = 10`, so xserv loads v9 as true GQA rather than MHA.

## Training

| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **91745** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **21h15m** |
| steady throughput | **~78.6K tok/s** |
| peak observed memory | ~17GB / GPU |

Command:

```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
  /opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
  --heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
  --steps 91745 --batch 128 --accum-steps 2 --seq 256 \
  --max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
  --eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
  --ckpt /dashscope-tmp/wjh/xtrain_v9.ckpt
```

## Results

- train loss: **11.1550 -> 2.9340**
- first val: step 1000 = **5.1517**
- best val: step 91000 = **2.8854**
- final val: step 91745 = **2.8873**
- exit code: **0**

FineWeb val curve milestones:

| step | 1000 | 10000 | 20000 | 30000 | 40000 | 50000 | 60000 | 70000 | 80000 | 90000 | 91000 | final |
|------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.1517 | 3.4820 | 3.2953 | 3.2026 | 3.1422 | 3.0844 | 3.0148 | 2.9616 | 2.9160 | 2.8915 | **2.8854** | 2.8873 |

The curve kept improving into the last 1K-step window, then the final eval bounced slightly from 2.8854 to 2.8873. This is close to
the floor for this run, but not a clear overfit failure.

## Comparison

| | v6 | v7 | v8 | **v9** |
|---|---|---|---|---|
| model | dim768/core127M | dim768/core127M | dim1024/core226M | **dim1280/core357M + GQA** |
| data | 2.29B | 3.28B same subset | 2.36B same subset | **6.01B expanded shards** |
| best val | 3.0652 | 3.0149 | 2.9801 | **2.8854** |

On the run-local moving tail, v9 beats v8 by **0.0947** val loss (~3.2% relative), essentially the same size as the
v6->v8 capacity gain but now on top of it. A later fixed eval v1 check still supports the same direction
(v8 3.1515 -> v9 2.9278 on shard010-tail holdout), while making the moving-tail caveat explicit. This confirms
the v8 prediction: **双轴 scale 有效**. It is still an incremental gain, not a qualitative jump.

## Samples

xserv greedy samples (`--max-tokens 60`) are more coherent than the v8 examples on some prompts, but repetition remains:

```text
[The history of] the United States is the story of the people, the places, and the events that have shaped the nation...
[In science,] the term "scientific method" is used to describe the process of gathering information and testing it...
[The most important] thing is to be aware of the symptoms and to seek medical attention...
[Water is] a natural resource that is essential for human life...
```

The model writes real explanatory English and the domain mix is FineWeb-like. Greedy decoding still falls into repeated clauses on
some prompts (`scientific method`, symptoms, and earlier fixed prompts), so the val gain is more visible in the metric than in a
dramatic sample-quality leap.

## xserv validation

Registry path:

```text
/opt/wjh/projects/tiny-models/v9-fineweb-edu-dim1280-gqa
```

Files:

- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)
- `RUN.md`

xserv loads v9 as:

```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```

Token-match check against xtrain greedy (`max-tokens 40`):

- `Once upon a time`: xtrain and xserv matched through the checked continuation.
- `One day`: diverged after "large, dark," (`very tall man` vs `metallic object`) from BF16 greedy tie sensitivity.
- `The little`: same repetitive pattern, with a short BF16 path divergence.

This is the same class of BF16-vs-f32 greedy drift seen in v8; the important integration result is that xserv successfully loads
true GQA (`kv_heads=10 < heads=40`) and generates from the exported weights.