xtrain/docs/runs/10-v10-fineweb-edu-dim1280-gqa-data6765.md

# Scaling Run v10: Data-axis follow-up — dim1280/18L true GQA + FineWeb-edu 6.765B token — Design Document

## Goal

v9 证明了双轴 scale（更大模型 + 更多新 token）有效：best val 从 v8 的 2.9801 降到 2.8854。
但 v9 的数据量只有 6.013B token，D/N 约 16.8，低于 Chinchilla 经验里的 20。v10 的目标很窄：

1. **只补数据轴**：补上 v9 中断的 FineWeb-edu shard010，把 cache 从 6.013B 推到 6.765B。
2. **架构不变**：完全复用 v9 dim1280 / 18L / 40q-10kv GQA / ffn4096。
3. **验证边际**：看 D/N 从 16.8 到 18.95 是否还能显著降低 val。

## Data

| 项 | 值 |
|----|----|
| 来源 | FineWeb-edu `sample/10BT`，shards 000-010 |
| token cache | `data/fineweb-edu.txt.u16.bin` |
| 总 token | **6,765,333,808** |
| held-out val | 末尾 **1,000,000** token |
| train corpus | 6,764,333,808 token |
| 训练消费 token | **6,764,298,240** = 103215 steps x effective batch 256 x seq 256 |
| epoch | ~1.00 |

Important caveat: xtrain 当前训练入口用“全 cache 的末尾 1M token”做 held-out。追加 shard010 后，v10 的 val tail
和 v9 的 val tail 不再是同一个切片。因此 v9 原报告的 2.8854 与 v10 原报告的 2.8816 不能被当作严格同一
验证集上的横比。

为了解决这个问题，本轮创建了固定 eval set：

```text
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
```

它包含 shard010 末尾 11M token；前 10M token 只是为了复用现有 `split_tail(val_tokens=1M)`，真正 eval 的是最后
1M token。该 fixed eval v1 对 v6-v9 都是未见数据；对 v10 也是训练时 held-out。

## Architecture

v10 与 v9 完全相同：

| 项 | 值 |
|----|----|
| dim | 1280 |
| layers | 18 |
| query heads x head_dim | 40 x 32 |
| kv heads | 10 (true GQA, group=4) |
| ffn | 4096 |
| core params | 356.89M |
| total params | 485.55M |
| export tensors | 201 |

## Training

| 项 | 值 |
|----|----|
| optimizer | hand-written AdamW, wd=0.1 |
| schedule | warmup -> cosine, max_lr 6e-4 -> min_lr 6e-5 |
| grad clip | global norm 1.0 |
| steps | **103215** |
| effective global batch | **256** (`--batch 128 --accum-steps 2`) |
| seq_len | 256 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | activation recompute + flash-attention + gradient accumulation |
| world size | 8 x RTX 5090 |
| wall clock | **23h51m** |
| steady throughput | **~79.0K tok/s** |
| peak observed memory | ~17GB / GPU |

Command:

```sh
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 cargo run -p xtrain-distributed --release --bin train_ddp -- \
  /opt/wjh/models/gpt2/tokenizer.json data/fineweb-edu.txt \
  --heads 40 --head-dim 32 --kv-heads 10 --layers 18 --ffn 4096 \
  --steps 103215 --batch 128 --accum-steps 2 --seq 256 \
  --max-lr 6e-4 --min-lr 6e-5 --val-tokens 1000000 --eval-every 1000 \
  --eval-batches 64 --bf16 --recompute --flash --dropout 0.0 \
  --ckpt /dashscope-tmp/wjh/xtrain_v10.ckpt
```

## Results

- train loss: **11.1575 -> 2.9000**
- first val: step 999 = **5.3048**
- best val: step 103214 = **2.8816**
- final val: step 103214 = **2.8816**
- exit code: **0**

FineWeb moving-tail val milestones:

| step | 999 | 9999 | 19999 | 29999 | 39999 | 49999 | 59999 | 69999 | 79999 | 89999 | 99999 | final |
|------|-----|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| val | 5.3048 | 3.5622 | 3.3282 | 3.2450 | 3.1886 | 3.1342 | 3.0714 | 3.0202 | 2.9724 | 2.9236 | 2.8950 | **2.8816** |

The curve still improved at the final eval. There is no overfit signal in this run.

## Fixed Eval V1

Fixed eval v1 (`shard010 tail 1M`, seq256, 64 eval batches):

| version | fixed eval v1 |
|---------|---------------|
| v6 | 3.2328 |
| v7 | 3.1850 |
| v8 | 3.1515 |
| v9 | 2.9278 |
| **v10** | **2.8814** |

This is the cleanest cross-version result in the v10 round. It says:

- v9's double-axis gain transfers to a shard010 holdout: v8 3.1515 -> v9 2.9278.
- v10 further improves on the new shard010 distribution: v9 2.9278 -> v10 2.8814.
- The apparent v9 moving-tail 2.8854 -> v10 moving-tail 2.8816 delta is tiny and not strict apples-to-apples.

## Decoding

Greedy decoding still repeats. Fixed prompts from xtrain:

```text
[Once upon a time] there was a king who had a daughter. She was beautiful and beautiful...
[The little] The little boy was a little boy. The little boy was a little boy...
[One day] I was walking down the street and I saw a man with a dog...
```

Temperature 0.8 is more varied and less immediately looped, but coherence remains weak:

```text
[Once upon a time] I was a kid who did not go to the beach to swim...
[The little] ones are not as loud as the adults...
[One day] I was on the edge of the water, and I saw something I had never seen before...
```

xserv loads the exported v10 true-GQA weights and generates FineWeb-like explanatory prose, but repeated sentence frames remain:

```text
[The history of] the city of San Francisco is a story of the growth of the city...
[In science,] the term "observation" is used to describe the act of observing something...
[Water is] the most important element in the human body...
```

Conclusion: decoding remains a separate bottleneck. The current xtrain sampler only supports greedy and temperature sampling; top-p and
repetition penalty exist in xserv's chat path, but not in the raw xtrain sampler or `xserv-cli` path used for weight validation. A clean
next step is to add a raw generation tool with `temperature/top-p/repetition-penalty` so decoding experiments do not depend on chat
templates.

## xserv Validation

Registry path:

```text
/opt/wjh/projects/tiny-models/v10-fineweb-edu-dim1280-gqa-data6765
```

Files:

- `config.json`
- `model.safetensors` (BF16, 201 tensors, 927MB)
- `tokenizer.json`
- `xtrain.ckpt` (fp32 master checkpoint, 1.9GB)

xserv loads v10 as true GQA:

```text
Model: qwen3, layers=18, hidden=1280, heads=40/10 kv, vocab=50257
Loaded 201 tensors
Ready (KV cache, dtype=bf16).
```

## v11 Feasibility: Bigger Model + Longer Context

A v11 smoke test prioritized the user's chosen direction: larger model plus longer context.

Candidate:

| item | value |
|------|-------|
| dim / layers | 1536 / 20 |
| heads / kv_heads | 48 / 12 |
| ffn | 6144 |
| core / total params | 684.26M / 838.65M |
| stack | bf16 + recompute + flash + accum + 8 GPU DDP |

Smoke results:

| seq | batch / accum | effective batch | peak mem | tok/s | result |
|-----|---------------|-----------------|----------|-------|--------|
| 512 | 64 / 4 | 256 | **30530 MiB** | **44.7K** | 50 steps OK |
| 1024 | 32 / 8 | 256 | **30530 MiB** | **31.0K** | 20 steps OK |

Both fit, but the memory margin is thin on 32GB RTX 5090. Expected one-epoch wall clock on 6.76B tokens:

- seq512: roughly **42h**
- seq1024: roughly **61h**

Recommendation: make v11 a controlled run, not a blind launch. Use fixed eval v1, keep data fixed, and choose either:

1. **v11a practical**: dim1536/20L, seq512, batch64/accum4. Faster, still doubles context over v10.
2. **v11b long-context**: dim1536/20L, seq1024, batch32/accum8. More aligned with "long context", but ~2.5 days and tight memory.

For scientific clarity, v11 should not append more data before training; use the current 6.765B train cache while preserving fixed eval v1.