docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check

- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:19:12 +08:00
parent fbf4ac2917
commit 7a1fba95b5
4 changed files with 770 additions and 0 deletions
--- a/docs/runs/12-v12-1b-longctx-chat-alpha.md
+++ b/docs/runs/12-v12-1b-longctx-chat-alpha.md
@@ -0,0 +1,251 @@
+# Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document
+
+## Goal
+
+v11 proved that a larger `dim1536/20L` model can train at `seq1024` on dash5, and it improved the fixed-eval-data-v1, long-context (`seq1024`) score to **2.7467**. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss.
+
+v12 therefore separates the next phase into two gates:
+
+1. **Base gate**: train a stronger English base model around 1B total params with `seq1024`, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1.
+2. **Chat gate**: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone.
+
+Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts.
+
+## Baseline: What v11 Taught Us
+
+| item | v11 |
+|------|-----|
+| arch | dim1536 / 20L / 48q-12kv GQA / ffn6144 |
+| params | 684.26M core / 838.65M total |
+| data | FineWeb-edu 6.765B token, 1 epoch |
+| context | seq1024 |
+| throughput | ~30.96K tok/s on 8 x RTX 5090 |
+| fixed eval data v1, seq1024 | **2.7467** |
+| issue | greedy repetition remains; direct real SFT regressed generation quality |
+
+SFT result from v11:
+
+| model | train result | generation result |
+|-------|--------------|-------------------|
+| `v11-chat-alpha-sft-v2-anchor` | synthetic assistant-only anchor | current best narrow chat-alpha |
+| `v11-chat-alpha-real-sft-v1` | SFT val 1.4272 | bad hallucination, math failure |
+| `v11-chat-alpha-real-mix-v1` | SFT val 2.0543 | better than direct real-SFT, still worse than anchor |
+
+Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model.
+
+## Architecture
+
+v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4.
+
+| item | value |
+|------|-------|
+| dim | **1664** |
+| layers | **22** |
+| query heads x head_dim | **52 x 32** |
+| kv heads | **13** |
+| GQA group | 4 |
+| ffn | **6656** |
+| core params | **883.4M** |
+| embed + lm_head | **167.3M** |
+| total params | **1.0506B** |
+
+Why this shape:
+
+- It is a controlled step from v11 rather than a new architecture family.
+- `52/13` preserves true GQA with group 4.
+- Total params are near the requested 1B target.
+- `dim1664` is less aggressive than `dim1792/22L` and has a better chance to fit `seq1024` on 32GB 5090s.
+
+## Data
+
+Base pretraining stays English-oriented and uses the current token cache. Pass the `.txt` stem to xtrain; `Corpus::load_cached` appends `.u16.bin` internally.
+
+```text
+/opt/wjh/projects/xtrain/data/fineweb-edu.txt
+cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin
+tokens = 6,765,333,808
+```
+
+Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context `seq1024` setting, matching the v11 `eval_v11_seq1024.log` score of **2.7467**. This is distinct from the older v10 table that used the same fixed eval data with `seq256`.
+
+```text
+/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt
+cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
+```
+
+No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change.
+
+## Training Plan
+
+Primary v12 run:
+
+| item | value |
+|------|-------|
+| world | 8 x RTX 5090 on dash5 |
+| precision | bf16 mixed precision, fp32 master |
+| memory stack | recompute + flash + grad accumulation |
+| seq | **1024** |
+| micro global batch | **16** (2 sequences/rank) |
+| accum | **15** |
+| effective global batch | **240** |
+| tokens/step | **245,760** |
+| full steps | **27,524** |
+| max_lr → min_lr | **4e-4 → 4e-5** |
+| eval | moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints |
+| smoke throughput | **~24.5K tok/s** |
+| estimated full wall clock | **~76-78h** |
+
+The reduced micro-batch is intentional: v11 `seq1024` with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial `batch24/accum10` smoke OOMed after step 0, while `batch16/accum15` passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step.
+
+Command wrapper:
+
+```sh
+scripts/run_v12_phase.sh start-pilot
+scripts/run_v12_phase.sh start-full
+scripts/run_v12_phase.sh status
+scripts/run_v12_phase.sh eval-fixed
+scripts/run_v12_phase.sh export
+scripts/run_v12_phase.sh sample
+```
+
+## Gates
+
+### Gate 0: build and smoke
+
+Run:
+
+```sh
+scripts/run_v12_phase.sh smoke
+```
+
+Pass criteria:
+
+- no CUDA OOM
+- no NaN loss
+- first 30 steps decrease from initialization
+- peak memory leaves enough margin for eval
+
+### Gate 1: pilot
+
+Run:
+
+```sh
+scripts/run_v12_phase.sh start-pilot
+```
+
+Default pilot is 300 steps with held-out eval every 100 steps.
+
+Pass criteria:
+
+- train loss decreases smoothly
+- grad norm does not spike persistently
+- moving-tail eval is finite and improving
+- checkpoint can be reloaded by `eval-fixed`
+
+### Gate 2: full base
+
+Run only after the pilot passes:
+
+```sh
+scripts/run_v12_phase.sh start-full
+```
+
+Pass criteria:
+
+- fixed eval data v1 at `seq1024` beats v11's **2.7467**
+- generation samples improve or at least do not regress on repetition
+- checkpoint exports and xserv loads the true GQA config
+
+### Gate 3: chat-alpha SFT
+
+After a healthy v12 base:
+
+1. Use assistant-only SFT (`--sft-tsv`) with English-only data.
+2. Start from narrow anchors first, then mix in Smol-SmolTalk.
+3. Judge with fixed generation prompts before calling it useful.
+
+The primary high-quality source remains `HuggingFaceTB/smol-smoltalk` filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior.
+
+## Evaluation
+
+Base metrics:
+
+- moving-tail val during training
+- fixed eval data v1 at `seq1024`
+- xtrain fixed prompt samples from `scripts/chat_alpha_fixed_prompts.txt`
+- xserv exported-model smoke
+
+Chat metrics:
+
+- fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation
+- compare against `v11-chat-alpha-sft-v2-anchor`
+- reject models that lower SFT validation loss but hallucinate more in fixed prompts
+
+## Artifacts
+
+Expected paths:
+
+```text
+/dashscope-tmp/wjh/xtrain_v12/
+/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt
+/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
+/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
+```
+
+## Results
+
+### Gate 0/1: smoke + pilot
+
+- `batch24/accum10` smoke OOMed after step 0.
+- `batch16/accum15` smoke passed 10 steps: train loss **11.2347 -> 7.9459**, ~24.5K tok/s, ~29.4GB/GPU.
+- 300-step pilot passed: train loss **11.2296 -> 5.4832**, val **6.5810 -> 5.9642 -> 5.5888**, exit code 0.
+- Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = **5.5891**.
+- Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality.
+
+### Gate 2: full base
+
+Full run completed on dash5:
+
+| item | result |
+|------|--------|
+| wall clock | **81h01m** |
+| throughput | **~24.55K tok/s** |
+| train loss | **11.2294 -> 2.6696** |
+| moving-tail best val | **2.7411** |
+| moving-tail final val | **2.7412** |
+| fixed eval data v1, seq1024 reload | **2.7410** |
+| exit code | **0** |
+
+Validation milestones:
+
+| step | 499 | 999 | 1499 | 1999 | 2499 | 21999 | 23999 | 25999 | 26999 | 27499 | final |
+|------|-----|-----|------|------|------|-------|-------|-------|-------|-------|-------|
+| val | 5.3029 | 4.4079 | 3.9287 | 3.6964 | 3.5555 | 2.7805 | 2.7637 | 2.7468 | 2.7443 | **2.7411** | 2.7412 |
+
+Compared with v11's fixed eval data v1 at seq1024 (**2.7467**), v12 reaches **2.7410** after reload. This is a real but very small gain
+(~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the
+larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump.
+
+Generation:
+
+- Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose.
+- Greedy repetition remains visible, especially on story-like prompts.
+- Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text.
+- xserv loads the export correctly as true GQA: `layers=22, hidden=1664, heads=52/13 kv`.
+
+Exported model:
+
+```text
+/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
+```
+
+Files:
+
+- `config.json`
+- `model.safetensors` (2.0GB)
+- `tokenizer.json`
+- `xtrain.ckpt` (4.0GB)
+
+Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be
+assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to
+serve as a usable chat model.
--- a/docs/runs/13-v12-chat-sft-quality.md
+++ b/docs/runs/13-v12-chat-sft-quality.md
@@ -0,0 +1,180 @@
+# v12 Chat SFT Quality Check
+
+Date: 2026-06-29
+
+## Goal
+
+Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with
+SFT, then judge whether it is stable enough to call a high-quality chat model.
+
+Base checkpoint:
+
+```text
+/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
+```
+
+Architecture:
+
+```text
+dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656
+total params=1.0506B
+```
+
+## Stage A: Synthetic SFT
+
+Data:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv
+211,257 examples, about 14.96M SFT tokens
+```
+
+Run:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt
+```
+
+Metrics:
+
+```text
+train loss: 3.5730 -> 0.0426
+eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423,
+      step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378
+best/final val loss: 0.0378
+```
+
+Export:
+
+```text
+/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2
+```
+
+Quality notes:
+
+- Learns the User/Assistant format and usually stops correctly.
+- Too narrow and template-heavy.
+- Fails basic math and code prompts in fixed greedy evaluation.
+
+## Stage B: Anchor SFT
+
+Data:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv
+32,020 examples, about 1.73M SFT tokens
+```
+
+Run:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
+```
+
+Metrics:
+
+```text
+train loss: 1.7777 -> 0.1165
+eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158
+best/final val loss: 0.1158
+```
+
+Export:
+
+```text
+/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor
+```
+
+Generation artifacts:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt
+/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt
+```
+
+Quality notes:
+
+- Better project-context answers and summaries than synthetic-only.
+- Still unreliable on basic multiplication, yes/no facts, translation, and code.
+- Overuses "cannot verify" style answers outside appropriate uncertainty cases.
+
+## Stage C: Real-Mix Repair
+
+Data:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv
+96,287 examples, about 25.3M SFT tokens
+```
+
+Run:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt
+```
+
+Training setup:
+
+```text
+init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
+steps=200
+seq=512
+batch=32
+accum=8
+effective batch=256
+lr=1e-6 -> 2e-7
+```
+
+Metrics:
+
+```text
+train loss: 2.7391 -> 2.0384
+eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570
+best/final val loss: 1.9570
+```
+
+Export:
+
+```text
+/opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair
+```
+
+Generation artifacts:
+
+```text
+/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt
+/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt
+/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt
+```
+
+Quality notes:
+
+- Loss improved cleanly and the model kept chat formatting.
+- Fixed prompt math `17% of 240` improved in the standard suite.
+- General diagnostic math still fails, e.g. `12 * 13`.
+- Code generation remains unusable for simple Python function prompts.
+- Some outputs contain corrupted or off-topic fragments.
+- Reducing repeat penalty from 1.15 to 1.0 did not fix the failures.
+
+## Verdict
+
+The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows
+the prompt format and stops correctly. However, none of the three SFT variants is a
+stable high-quality chat model yet.
+
+The limiting issue is no longer infrastructure. It is data and objective quality:
+the current synthetic/anchor data is too narrow, while the current real-mix data
+adds breadth but also noisy or low-quality behavior. Validation loss alone is not a
+sufficient selection signal for chat quality.
+
+## Recommended Next Step
+
+Build a smaller, higher-precision SFT curriculum before another large run:
+
+1. Keep the anchor data, but reduce over-refusal templates.
+2. Add verified small instruction sets for math, code, translation, summarization,
+   and closed-book common facts.
+3. Add an automatic fixed-prompt eval harness that scores exact-match math, simple
+   code syntax, refusal appropriateness, stop-token behavior, and corruption.
+4. Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick
+   by generation eval rather than SFT loss alone.