Files

Gahow Wang 7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check

- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens,
  81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but
  marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2%
  return confirms the 1B base is now data-limited.
- run 13: three SFT stages from the v12 base (synthetic / anchor /
  real-mix-repair). The pipeline works and produces a chat-shaped model that
  follows the format and stops, but none of the variants is a stable
  high-quality chat model — bottleneck is SFT data quality + selection signal
  (val loss decouples from generation quality), not infra.
- scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 16:19:12 +08:00

9.0 KiB

Raw Blame History

Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document

Goal

v11 proved that a larger dim1536/20L model can train at seq1024 on dash5, and it improved the fixed-eval-data-v1, long-context (seq1024) score to 2.7467. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss.

v12 therefore separates the next phase into two gates:

Base gate: train a stronger English base model around 1B total params with seq1024, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1.
Chat gate: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone.

Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts.

Baseline: What v11 Taught Us

item	v11
arch	dim1536 / 20L / 48q-12kv GQA / ffn6144
params	684.26M core / 838.65M total
data	FineWeb-edu 6.765B token, 1 epoch
context	seq1024
throughput	~30.96K tok/s on 8 x RTX 5090
fixed eval data v1, seq1024	2.7467
issue	greedy repetition remains; direct real SFT regressed generation quality

SFT result from v11:

model	train result	generation result
`v11-chat-alpha-sft-v2-anchor`	synthetic assistant-only anchor	current best narrow chat-alpha
`v11-chat-alpha-real-sft-v1`	SFT val 1.4272	bad hallucination, math failure
`v11-chat-alpha-real-mix-v1`	SFT val 2.0543	better than direct real-SFT, still worse than anchor

Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model.

Architecture

v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4.

item	value
dim	1664
layers	22
query heads x head_dim	52 x 32
kv heads	13
GQA group	4
ffn	6656
core params	883.4M
embed + lm_head	167.3M
total params	1.0506B

Why this shape:

It is a controlled step from v11 rather than a new architecture family.
52/13 preserves true GQA with group 4.
Total params are near the requested 1B target.
dim1664 is less aggressive than dim1792/22L and has a better chance to fit seq1024 on 32GB 5090s.

Data

Base pretraining stays English-oriented and uses the current token cache. Pass the .txt stem to xtrain; Corpus::load_cached appends .u16.bin internally.

/opt/wjh/projects/xtrain/data/fineweb-edu.txt
cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin
tokens = 6,765,333,808

Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context seq1024 setting, matching the v11 eval_v11_seq1024.log score of 2.7467. This is distinct from the older v10 table that used the same fixed eval data with seq256.

/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt
cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin

No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change.

Training Plan

Primary v12 run:

item	value
world	8 x RTX 5090 on dash5
precision	bf16 mixed precision, fp32 master
memory stack	recompute + flash + grad accumulation
seq	1024
micro global batch	16 (2 sequences/rank)
accum	15
effective global batch	240
tokens/step	245,760
full steps	27,524
max_lr → min_lr	4e-4 → 4e-5
eval	moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints
smoke throughput	~24.5K tok/s
estimated full wall clock	~76-78h

The reduced micro-batch is intentional: v11 seq1024 with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial batch24/accum10 smoke OOMed after step 0, while batch16/accum15 passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step.

Command wrapper:

scripts/run_v12_phase.sh start-pilot
scripts/run_v12_phase.sh start-full
scripts/run_v12_phase.sh status
scripts/run_v12_phase.sh eval-fixed
scripts/run_v12_phase.sh export
scripts/run_v12_phase.sh sample

Gates

Gate 0: build and smoke

Run:

scripts/run_v12_phase.sh smoke

Pass criteria:

no CUDA OOM
no NaN loss
first 30 steps decrease from initialization
peak memory leaves enough margin for eval

Gate 1: pilot

Run:

scripts/run_v12_phase.sh start-pilot

Default pilot is 300 steps with held-out eval every 100 steps.

Pass criteria:

train loss decreases smoothly
grad norm does not spike persistently
moving-tail eval is finite and improving
checkpoint can be reloaded by eval-fixed

Gate 2: full base

Run only after the pilot passes:

scripts/run_v12_phase.sh start-full

Pass criteria:

fixed eval data v1 at seq1024 beats v11's 2.7467
generation samples improve or at least do not regress on repetition
checkpoint exports and xserv loads the true GQA config

Gate 3: chat-alpha SFT

After a healthy v12 base:

Use assistant-only SFT (--sft-tsv) with English-only data.
Start from narrow anchors first, then mix in Smol-SmolTalk.
Judge with fixed generation prompts before calling it useful.

The primary high-quality source remains HuggingFaceTB/smol-smoltalk filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior.

Evaluation

Base metrics:

moving-tail val during training
fixed eval data v1 at seq1024
xtrain fixed prompt samples from scripts/chat_alpha_fixed_prompts.txt
xserv exported-model smoke

Chat metrics:

fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation
compare against v11-chat-alpha-sft-v2-anchor
reject models that lower SFT validation loss but hallucinate more in fixed prompts

Artifacts

Expected paths:

/dashscope-tmp/wjh/xtrain_v12/
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx

Results

Gate 0/1: smoke + pilot

batch24/accum10 smoke OOMed after step 0.
batch16/accum15 smoke passed 10 steps: train loss 11.2347 -> 7.9459, ~24.5K tok/s, ~29.4GB/GPU.
300-step pilot passed: train loss 11.2296 -> 5.4832, val 6.5810 -> 5.9642 -> 5.5888, exit code 0.
Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = 5.5891.
Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality.

Gate 2: full base

Full run completed on dash5:

item	result
wall clock	81h01m
throughput	~24.55K tok/s
train loss	11.2294 -> 2.6696
moving-tail best val	2.7411
moving-tail final val	2.7412
fixed eval data v1, seq1024 reload	2.7410
exit code	0

Validation milestones:

step	499	999	1499	1999	2499	21999	23999	25999	26999	27499	final
val	5.3029	4.4079	3.9287	3.6964	3.5555	2.7805	2.7637	2.7468	2.7443	2.7411	2.7412

Compared with v11's fixed eval data v1 at seq1024 (2.7467), v12 reaches 2.7410 after reload. This is a real but very small gain (~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump.

Generation:

Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose.
Greedy repetition remains visible, especially on story-like prompts.
Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text.
xserv loads the export correctly as true GQA: layers=22, hidden=1664, heads=52/13 kv.

Exported model:

/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx

Files:

config.json
model.safetensors (2.0GB)
tokenizer.json
xtrain.ckpt (4.0GB)

Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to serve as a usable chat model.

9.0 KiB Raw Blame History

Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document

Goal

Baseline: What v11 Taught Us

Architecture

Data

Training Plan

Gates

Gate 0: build and smoke

Gate 1: pilot

Gate 2: full base

Gate 3: chat-alpha SFT

Evaluation

Artifacts

Results

Gate 0/1: smoke + pilot

Gate 2: full base

9.0 KiB

Raw Blame History