- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.0 KiB
Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document
Goal
v11 proved that a larger dim1536/20L model can train at seq1024 on dash5, and it improved the fixed-eval-data-v1, long-context (seq1024) score to 2.7467. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss.
v12 therefore separates the next phase into two gates:
- Base gate: train a stronger English base model around 1B total params with
seq1024, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1. - Chat gate: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone.
Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts.
Baseline: What v11 Taught Us
| item | v11 |
|---|---|
| arch | dim1536 / 20L / 48q-12kv GQA / ffn6144 |
| params | 684.26M core / 838.65M total |
| data | FineWeb-edu 6.765B token, 1 epoch |
| context | seq1024 |
| throughput | ~30.96K tok/s on 8 x RTX 5090 |
| fixed eval data v1, seq1024 | 2.7467 |
| issue | greedy repetition remains; direct real SFT regressed generation quality |
SFT result from v11:
| model | train result | generation result |
|---|---|---|
v11-chat-alpha-sft-v2-anchor |
synthetic assistant-only anchor | current best narrow chat-alpha |
v11-chat-alpha-real-sft-v1 |
SFT val 1.4272 | bad hallucination, math failure |
v11-chat-alpha-real-mix-v1 |
SFT val 2.0543 | better than direct real-SFT, still worse than anchor |
Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model.
Architecture
v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4.
| item | value |
|---|---|
| dim | 1664 |
| layers | 22 |
| query heads x head_dim | 52 x 32 |
| kv heads | 13 |
| GQA group | 4 |
| ffn | 6656 |
| core params | 883.4M |
| embed + lm_head | 167.3M |
| total params | 1.0506B |
Why this shape:
- It is a controlled step from v11 rather than a new architecture family.
52/13preserves true GQA with group 4.- Total params are near the requested 1B target.
dim1664is less aggressive thandim1792/22Land has a better chance to fitseq1024on 32GB 5090s.
Data
Base pretraining stays English-oriented and uses the current token cache. Pass the .txt stem to xtrain; Corpus::load_cached appends .u16.bin internally.
/opt/wjh/projects/xtrain/data/fineweb-edu.txt
cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin
tokens = 6,765,333,808
Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context seq1024 setting, matching the v11 eval_v11_seq1024.log score of 2.7467. This is distinct from the older v10 table that used the same fixed eval data with seq256.
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt
cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change.
Training Plan
Primary v12 run:
| item | value |
|---|---|
| world | 8 x RTX 5090 on dash5 |
| precision | bf16 mixed precision, fp32 master |
| memory stack | recompute + flash + grad accumulation |
| seq | 1024 |
| micro global batch | 16 (2 sequences/rank) |
| accum | 15 |
| effective global batch | 240 |
| tokens/step | 245,760 |
| full steps | 27,524 |
| max_lr → min_lr | 4e-4 → 4e-5 |
| eval | moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints |
| smoke throughput | ~24.5K tok/s |
| estimated full wall clock | ~76-78h |
The reduced micro-batch is intentional: v11 seq1024 with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial batch24/accum10 smoke OOMed after step 0, while batch16/accum15 passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step.
Command wrapper:
scripts/run_v12_phase.sh start-pilot
scripts/run_v12_phase.sh start-full
scripts/run_v12_phase.sh status
scripts/run_v12_phase.sh eval-fixed
scripts/run_v12_phase.sh export
scripts/run_v12_phase.sh sample
Gates
Gate 0: build and smoke
Run:
scripts/run_v12_phase.sh smoke
Pass criteria:
- no CUDA OOM
- no NaN loss
- first 30 steps decrease from initialization
- peak memory leaves enough margin for eval
Gate 1: pilot
Run:
scripts/run_v12_phase.sh start-pilot
Default pilot is 300 steps with held-out eval every 100 steps.
Pass criteria:
- train loss decreases smoothly
- grad norm does not spike persistently
- moving-tail eval is finite and improving
- checkpoint can be reloaded by
eval-fixed
Gate 2: full base
Run only after the pilot passes:
scripts/run_v12_phase.sh start-full
Pass criteria:
- fixed eval data v1 at
seq1024beats v11's 2.7467 - generation samples improve or at least do not regress on repetition
- checkpoint exports and xserv loads the true GQA config
Gate 3: chat-alpha SFT
After a healthy v12 base:
- Use assistant-only SFT (
--sft-tsv) with English-only data. - Start from narrow anchors first, then mix in Smol-SmolTalk.
- Judge with fixed generation prompts before calling it useful.
The primary high-quality source remains HuggingFaceTB/smol-smoltalk filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior.
Evaluation
Base metrics:
- moving-tail val during training
- fixed eval data v1 at
seq1024 - xtrain fixed prompt samples from
scripts/chat_alpha_fixed_prompts.txt - xserv exported-model smoke
Chat metrics:
- fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation
- compare against
v11-chat-alpha-sft-v2-anchor - reject models that lower SFT validation loss but hallucinate more in fixed prompts
Artifacts
Expected paths:
/dashscope-tmp/wjh/xtrain_v12/
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
Results
Gate 0/1: smoke + pilot
batch24/accum10smoke OOMed after step 0.batch16/accum15smoke passed 10 steps: train loss 11.2347 -> 7.9459, ~24.5K tok/s, ~29.4GB/GPU.- 300-step pilot passed: train loss 11.2296 -> 5.4832, val 6.5810 -> 5.9642 -> 5.5888, exit code 0.
- Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = 5.5891.
- Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality.
Gate 2: full base
Full run completed on dash5:
| item | result |
|---|---|
| wall clock | 81h01m |
| throughput | ~24.55K tok/s |
| train loss | 11.2294 -> 2.6696 |
| moving-tail best val | 2.7411 |
| moving-tail final val | 2.7412 |
| fixed eval data v1, seq1024 reload | 2.7410 |
| exit code | 0 |
Validation milestones:
| step | 499 | 999 | 1499 | 1999 | 2499 | 21999 | 23999 | 25999 | 26999 | 27499 | final |
|---|---|---|---|---|---|---|---|---|---|---|---|
| val | 5.3029 | 4.4079 | 3.9287 | 3.6964 | 3.5555 | 2.7805 | 2.7637 | 2.7468 | 2.7443 | 2.7411 | 2.7412 |
Compared with v11's fixed eval data v1 at seq1024 (2.7467), v12 reaches 2.7410 after reload. This is a real but very small gain (~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump.
Generation:
- Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose.
- Greedy repetition remains visible, especially on story-like prompts.
- Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text.
- xserv loads the export correctly as true GQA:
layers=22, hidden=1664, heads=52/13 kv.
Exported model:
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
Files:
config.jsonmodel.safetensors(2.0GB)tokenizer.jsonxtrain.ckpt(4.0GB)
Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to serve as a usable chat model.