Files

Gahow Wang 7a1fba95b5 docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check

- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens,
  81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but
  marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2%
  return confirms the 1B base is now data-limited.
- run 13: three SFT stages from the v12 base (synthetic / anchor /
  real-mix-repair). The pipeline works and produces a chat-shaped model that
  follows the format and stops, but none of the variants is a stable
  high-quality chat model — bottleneck is SFT data quality + selection signal
  (val loss decouples from generation quality), not infra.
- scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 16:19:12 +08:00

4.4 KiB

Raw Blame History

v12 Chat SFT Quality Check

Date: 2026-06-29

Goal

Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with SFT, then judge whether it is stable enough to call a high-quality chat model.

Base checkpoint:

/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt

Architecture:

dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656
total params=1.0506B

Stage A: Synthetic SFT

Data:

/dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv
211,257 examples, about 14.96M SFT tokens

Run:

/dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt

Metrics:

train loss: 3.5730 -> 0.0426
eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423,
      step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378
best/final val loss: 0.0378

Export:

/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2

Quality notes:

Learns the User/Assistant format and usually stops correctly.
Too narrow and template-heavy.
Fails basic math and code prompts in fixed greedy evaluation.

Stage B: Anchor SFT

Data:

/dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv
32,020 examples, about 1.73M SFT tokens

Run:

/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt

Metrics:

train loss: 1.7777 -> 0.1165
eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158
best/final val loss: 0.1158

Export:

/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor

Generation artifacts:

/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt

Quality notes:

Better project-context answers and summaries than synthetic-only.
Still unreliable on basic multiplication, yes/no facts, translation, and code.
Overuses "cannot verify" style answers outside appropriate uncertainty cases.

Stage C: Real-Mix Repair

Data:

/dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv
96,287 examples, about 25.3M SFT tokens

Run:

/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt

Training setup:

init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
steps=200
seq=512
batch=32
accum=8
effective batch=256
lr=1e-6 -> 2e-7

Metrics:

train loss: 2.7391 -> 2.0384
eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570
best/final val loss: 1.9570

Export:

/opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair

Generation artifacts:

/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt

Quality notes:

Loss improved cleanly and the model kept chat formatting.
Fixed prompt math 17% of 240 improved in the standard suite.
General diagnostic math still fails, e.g. 12 * 13.
Code generation remains unusable for simple Python function prompts.
Some outputs contain corrupted or off-topic fragments.
Reducing repeat penalty from 1.15 to 1.0 did not fix the failures.

Verdict

The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows the prompt format and stops correctly. However, none of the three SFT variants is a stable high-quality chat model yet.

The limiting issue is no longer infrastructure. It is data and objective quality: the current synthetic/anchor data is too narrow, while the current real-mix data adds breadth but also noisy or low-quality behavior. Validation loss alone is not a sufficient selection signal for chat quality.

Recommended Next Step

Build a smaller, higher-precision SFT curriculum before another large run:

Keep the anchor data, but reduce over-refusal templates.
Add verified small instruction sets for math, code, translation, summarization, and closed-book common facts.
Add an automatic fixed-prompt eval harness that scores exact-match math, simple code syntax, refusal appropriateness, stop-token behavior, and corruption.
Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick by generation eval rather than SFT loss alone.

4.4 KiB Raw Blame History

v12 Chat SFT Quality Check

Goal

Stage A: Synthetic SFT

Stage B: Anchor SFT

Stage C: Real-Mix Repair

Verdict

Recommended Next Step

4.4 KiB

Raw Blame History