# Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document ## Goal v11 proved that a larger `dim1536/20L` model can train at `seq1024` on dash5, and it improved the fixed-eval-data-v1, long-context (`seq1024`) score to **2.7467**. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss. v12 therefore separates the next phase into two gates: 1. **Base gate**: train a stronger English base model around 1B total params with `seq1024`, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1. 2. **Chat gate**: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone. Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts. ## Baseline: What v11 Taught Us | item | v11 | |------|-----| | arch | dim1536 / 20L / 48q-12kv GQA / ffn6144 | | params | 684.26M core / 838.65M total | | data | FineWeb-edu 6.765B token, 1 epoch | | context | seq1024 | | throughput | ~30.96K tok/s on 8 x RTX 5090 | | fixed eval data v1, seq1024 | **2.7467** | | issue | greedy repetition remains; direct real SFT regressed generation quality | SFT result from v11: | model | train result | generation result | |-------|--------------|-------------------| | `v11-chat-alpha-sft-v2-anchor` | synthetic assistant-only anchor | current best narrow chat-alpha | | `v11-chat-alpha-real-sft-v1` | SFT val 1.4272 | bad hallucination, math failure | | `v11-chat-alpha-real-mix-v1` | SFT val 2.0543 | better than direct real-SFT, still worse than anchor | Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model. ## Architecture v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4. | item | value | |------|-------| | dim | **1664** | | layers | **22** | | query heads x head_dim | **52 x 32** | | kv heads | **13** | | GQA group | 4 | | ffn | **6656** | | core params | **883.4M** | | embed + lm_head | **167.3M** | | total params | **1.0506B** | Why this shape: - It is a controlled step from v11 rather than a new architecture family. - `52/13` preserves true GQA with group 4. - Total params are near the requested 1B target. - `dim1664` is less aggressive than `dim1792/22L` and has a better chance to fit `seq1024` on 32GB 5090s. ## Data Base pretraining stays English-oriented and uses the current token cache. Pass the `.txt` stem to xtrain; `Corpus::load_cached` appends `.u16.bin` internally. ```text /opt/wjh/projects/xtrain/data/fineweb-edu.txt cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin tokens = 6,765,333,808 ``` Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context `seq1024` setting, matching the v11 `eval_v11_seq1024.log` score of **2.7467**. This is distinct from the older v10 table that used the same fixed eval data with `seq256`. ```text /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin ``` No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change. ## Training Plan Primary v12 run: | item | value | |------|-------| | world | 8 x RTX 5090 on dash5 | | precision | bf16 mixed precision, fp32 master | | memory stack | recompute + flash + grad accumulation | | seq | **1024** | | micro global batch | **16** (2 sequences/rank) | | accum | **15** | | effective global batch | **240** | | tokens/step | **245,760** | | full steps | **27,524** | | max_lr → min_lr | **4e-4 → 4e-5** | | eval | moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints | | smoke throughput | **~24.5K tok/s** | | estimated full wall clock | **~76-78h** | The reduced micro-batch is intentional: v11 `seq1024` with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial `batch24/accum10` smoke OOMed after step 0, while `batch16/accum15` passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step. Command wrapper: ```sh scripts/run_v12_phase.sh start-pilot scripts/run_v12_phase.sh start-full scripts/run_v12_phase.sh status scripts/run_v12_phase.sh eval-fixed scripts/run_v12_phase.sh export scripts/run_v12_phase.sh sample ``` ## Gates ### Gate 0: build and smoke Run: ```sh scripts/run_v12_phase.sh smoke ``` Pass criteria: - no CUDA OOM - no NaN loss - first 30 steps decrease from initialization - peak memory leaves enough margin for eval ### Gate 1: pilot Run: ```sh scripts/run_v12_phase.sh start-pilot ``` Default pilot is 300 steps with held-out eval every 100 steps. Pass criteria: - train loss decreases smoothly - grad norm does not spike persistently - moving-tail eval is finite and improving - checkpoint can be reloaded by `eval-fixed` ### Gate 2: full base Run only after the pilot passes: ```sh scripts/run_v12_phase.sh start-full ``` Pass criteria: - fixed eval data v1 at `seq1024` beats v11's **2.7467** - generation samples improve or at least do not regress on repetition - checkpoint exports and xserv loads the true GQA config ### Gate 3: chat-alpha SFT After a healthy v12 base: 1. Use assistant-only SFT (`--sft-tsv`) with English-only data. 2. Start from narrow anchors first, then mix in Smol-SmolTalk. 3. Judge with fixed generation prompts before calling it useful. The primary high-quality source remains `HuggingFaceTB/smol-smoltalk` filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior. ## Evaluation Base metrics: - moving-tail val during training - fixed eval data v1 at `seq1024` - xtrain fixed prompt samples from `scripts/chat_alpha_fixed_prompts.txt` - xserv exported-model smoke Chat metrics: - fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation - compare against `v11-chat-alpha-sft-v2-anchor` - reject models that lower SFT validation loss but hallucinate more in fixed prompts ## Artifacts Expected paths: ```text /dashscope-tmp/wjh/xtrain_v12/ /dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt /dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt /opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx ``` ## Results ### Gate 0/1: smoke + pilot - `batch24/accum10` smoke OOMed after step 0. - `batch16/accum15` smoke passed 10 steps: train loss **11.2347 -> 7.9459**, ~24.5K tok/s, ~29.4GB/GPU. - 300-step pilot passed: train loss **11.2296 -> 5.4832**, val **6.5810 -> 5.9642 -> 5.5888**, exit code 0. - Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = **5.5891**. - Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality. ### Gate 2: full base Full run completed on dash5: | item | result | |------|--------| | wall clock | **81h01m** | | throughput | **~24.55K tok/s** | | train loss | **11.2294 -> 2.6696** | | moving-tail best val | **2.7411** | | moving-tail final val | **2.7412** | | fixed eval data v1, seq1024 reload | **2.7410** | | exit code | **0** | Validation milestones: | step | 499 | 999 | 1499 | 1999 | 2499 | 21999 | 23999 | 25999 | 26999 | 27499 | final | |------|-----|-----|------|------|------|-------|-------|-------|-------|-------|-------| | val | 5.3029 | 4.4079 | 3.9287 | 3.6964 | 3.5555 | 2.7805 | 2.7637 | 2.7468 | 2.7443 | **2.7411** | 2.7412 | Compared with v11's fixed eval data v1 at seq1024 (**2.7467**), v12 reaches **2.7410** after reload. This is a real but very small gain (~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump. Generation: - Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose. - Greedy repetition remains visible, especially on story-like prompts. - Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text. - xserv loads the export correctly as true GQA: `layers=22, hidden=1664, heads=52/13 kv`. Exported model: ```text /opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx ``` Files: - `config.json` - `model.safetensors` (2.0GB) - `tokenizer.json` - `xtrain.ckpt` (4.0GB) Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to serve as a usable chat model.