Files
xtrain/docs/runs/12-v12-1b-longctx-chat-alpha.md
Gahow Wang 7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check
- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens,
  81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but
  marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2%
  return confirms the 1B base is now data-limited.
- run 13: three SFT stages from the v12 base (synthetic / anchor /
  real-mix-repair). The pipeline works and produces a chat-shaped model that
  follows the format and stops, but none of the variants is a stable
  high-quality chat model — bottleneck is SFT data quality + selection signal
  (val loss decouples from generation quality), not infra.
- scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 16:19:12 +08:00

9.0 KiB

Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document

Goal

v11 proved that a larger dim1536/20L model can train at seq1024 on dash5, and it improved the fixed-eval-data-v1, long-context (seq1024) score to 2.7467. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss.

v12 therefore separates the next phase into two gates:

  1. Base gate: train a stronger English base model around 1B total params with seq1024, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1.
  2. Chat gate: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone.

Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts.

Baseline: What v11 Taught Us

item v11
arch dim1536 / 20L / 48q-12kv GQA / ffn6144
params 684.26M core / 838.65M total
data FineWeb-edu 6.765B token, 1 epoch
context seq1024
throughput ~30.96K tok/s on 8 x RTX 5090
fixed eval data v1, seq1024 2.7467
issue greedy repetition remains; direct real SFT regressed generation quality

SFT result from v11:

model train result generation result
v11-chat-alpha-sft-v2-anchor synthetic assistant-only anchor current best narrow chat-alpha
v11-chat-alpha-real-sft-v1 SFT val 1.4272 bad hallucination, math failure
v11-chat-alpha-real-mix-v1 SFT val 2.0543 better than direct real-SFT, still worse than anchor

Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model.

Architecture

v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4.

item value
dim 1664
layers 22
query heads x head_dim 52 x 32
kv heads 13
GQA group 4
ffn 6656
core params 883.4M
embed + lm_head 167.3M
total params 1.0506B

Why this shape:

  • It is a controlled step from v11 rather than a new architecture family.
  • 52/13 preserves true GQA with group 4.
  • Total params are near the requested 1B target.
  • dim1664 is less aggressive than dim1792/22L and has a better chance to fit seq1024 on 32GB 5090s.

Data

Base pretraining stays English-oriented and uses the current token cache. Pass the .txt stem to xtrain; Corpus::load_cached appends .u16.bin internally.

/opt/wjh/projects/xtrain/data/fineweb-edu.txt
cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin
tokens = 6,765,333,808

Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context seq1024 setting, matching the v11 eval_v11_seq1024.log score of 2.7467. This is distinct from the older v10 table that used the same fixed eval data with seq256.

/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt
cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin

No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change.

Training Plan

Primary v12 run:

item value
world 8 x RTX 5090 on dash5
precision bf16 mixed precision, fp32 master
memory stack recompute + flash + grad accumulation
seq 1024
micro global batch 16 (2 sequences/rank)
accum 15
effective global batch 240
tokens/step 245,760
full steps 27,524
max_lr → min_lr 4e-4 → 4e-5
eval moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints
smoke throughput ~24.5K tok/s
estimated full wall clock ~76-78h

The reduced micro-batch is intentional: v11 seq1024 with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial batch24/accum10 smoke OOMed after step 0, while batch16/accum15 passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step.

Command wrapper:

scripts/run_v12_phase.sh start-pilot
scripts/run_v12_phase.sh start-full
scripts/run_v12_phase.sh status
scripts/run_v12_phase.sh eval-fixed
scripts/run_v12_phase.sh export
scripts/run_v12_phase.sh sample

Gates

Gate 0: build and smoke

Run:

scripts/run_v12_phase.sh smoke

Pass criteria:

  • no CUDA OOM
  • no NaN loss
  • first 30 steps decrease from initialization
  • peak memory leaves enough margin for eval

Gate 1: pilot

Run:

scripts/run_v12_phase.sh start-pilot

Default pilot is 300 steps with held-out eval every 100 steps.

Pass criteria:

  • train loss decreases smoothly
  • grad norm does not spike persistently
  • moving-tail eval is finite and improving
  • checkpoint can be reloaded by eval-fixed

Gate 2: full base

Run only after the pilot passes:

scripts/run_v12_phase.sh start-full

Pass criteria:

  • fixed eval data v1 at seq1024 beats v11's 2.7467
  • generation samples improve or at least do not regress on repetition
  • checkpoint exports and xserv loads the true GQA config

Gate 3: chat-alpha SFT

After a healthy v12 base:

  1. Use assistant-only SFT (--sft-tsv) with English-only data.
  2. Start from narrow anchors first, then mix in Smol-SmolTalk.
  3. Judge with fixed generation prompts before calling it useful.

The primary high-quality source remains HuggingFaceTB/smol-smoltalk filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior.

Evaluation

Base metrics:

  • moving-tail val during training
  • fixed eval data v1 at seq1024
  • xtrain fixed prompt samples from scripts/chat_alpha_fixed_prompts.txt
  • xserv exported-model smoke

Chat metrics:

  • fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation
  • compare against v11-chat-alpha-sft-v2-anchor
  • reject models that lower SFT validation loss but hallucinate more in fixed prompts

Artifacts

Expected paths:

/dashscope-tmp/wjh/xtrain_v12/
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx

Results

Gate 0/1: smoke + pilot

  • batch24/accum10 smoke OOMed after step 0.
  • batch16/accum15 smoke passed 10 steps: train loss 11.2347 -> 7.9459, ~24.5K tok/s, ~29.4GB/GPU.
  • 300-step pilot passed: train loss 11.2296 -> 5.4832, val 6.5810 -> 5.9642 -> 5.5888, exit code 0.
  • Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = 5.5891.
  • Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality.

Gate 2: full base

Full run completed on dash5:

item result
wall clock 81h01m
throughput ~24.55K tok/s
train loss 11.2294 -> 2.6696
moving-tail best val 2.7411
moving-tail final val 2.7412
fixed eval data v1, seq1024 reload 2.7410
exit code 0

Validation milestones:

step 499 999 1499 1999 2499 21999 23999 25999 26999 27499 final
val 5.3029 4.4079 3.9287 3.6964 3.5555 2.7805 2.7637 2.7468 2.7443 2.7411 2.7412

Compared with v11's fixed eval data v1 at seq1024 (2.7467), v12 reaches 2.7410 after reload. This is a real but very small gain (~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump.

Generation:

  • Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose.
  • Greedy repetition remains visible, especially on story-like prompts.
  • Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text.
  • xserv loads the export correctly as true GQA: layers=22, hidden=1664, heads=52/13 kv.

Exported model:

/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx

Files:

  • config.json
  • model.safetensors (2.0GB)
  • tokenizer.json
  • xtrain.ckpt (4.0GB)

Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to serve as a usable chat model.