- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
181 lines
4.4 KiB
Markdown
181 lines
4.4 KiB
Markdown
# v12 Chat SFT Quality Check
|
|
|
|
Date: 2026-06-29
|
|
|
|
## Goal
|
|
|
|
Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with
|
|
SFT, then judge whether it is stable enough to call a high-quality chat model.
|
|
|
|
Base checkpoint:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
|
|
```
|
|
|
|
Architecture:
|
|
|
|
```text
|
|
dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656
|
|
total params=1.0506B
|
|
```
|
|
|
|
## Stage A: Synthetic SFT
|
|
|
|
Data:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv
|
|
211,257 examples, about 14.96M SFT tokens
|
|
```
|
|
|
|
Run:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt
|
|
```
|
|
|
|
Metrics:
|
|
|
|
```text
|
|
train loss: 3.5730 -> 0.0426
|
|
eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423,
|
|
step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378
|
|
best/final val loss: 0.0378
|
|
```
|
|
|
|
Export:
|
|
|
|
```text
|
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2
|
|
```
|
|
|
|
Quality notes:
|
|
|
|
- Learns the User/Assistant format and usually stops correctly.
|
|
- Too narrow and template-heavy.
|
|
- Fails basic math and code prompts in fixed greedy evaluation.
|
|
|
|
## Stage B: Anchor SFT
|
|
|
|
Data:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv
|
|
32,020 examples, about 1.73M SFT tokens
|
|
```
|
|
|
|
Run:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
|
|
```
|
|
|
|
Metrics:
|
|
|
|
```text
|
|
train loss: 1.7777 -> 0.1165
|
|
eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158
|
|
best/final val loss: 0.1158
|
|
```
|
|
|
|
Export:
|
|
|
|
```text
|
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor
|
|
```
|
|
|
|
Generation artifacts:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt
|
|
```
|
|
|
|
Quality notes:
|
|
|
|
- Better project-context answers and summaries than synthetic-only.
|
|
- Still unreliable on basic multiplication, yes/no facts, translation, and code.
|
|
- Overuses "cannot verify" style answers outside appropriate uncertainty cases.
|
|
|
|
## Stage C: Real-Mix Repair
|
|
|
|
Data:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv
|
|
96,287 examples, about 25.3M SFT tokens
|
|
```
|
|
|
|
Run:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt
|
|
```
|
|
|
|
Training setup:
|
|
|
|
```text
|
|
init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
|
|
steps=200
|
|
seq=512
|
|
batch=32
|
|
accum=8
|
|
effective batch=256
|
|
lr=1e-6 -> 2e-7
|
|
```
|
|
|
|
Metrics:
|
|
|
|
```text
|
|
train loss: 2.7391 -> 2.0384
|
|
eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570
|
|
best/final val loss: 1.9570
|
|
```
|
|
|
|
Export:
|
|
|
|
```text
|
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair
|
|
```
|
|
|
|
Generation artifacts:
|
|
|
|
```text
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt
|
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt
|
|
```
|
|
|
|
Quality notes:
|
|
|
|
- Loss improved cleanly and the model kept chat formatting.
|
|
- Fixed prompt math `17% of 240` improved in the standard suite.
|
|
- General diagnostic math still fails, e.g. `12 * 13`.
|
|
- Code generation remains unusable for simple Python function prompts.
|
|
- Some outputs contain corrupted or off-topic fragments.
|
|
- Reducing repeat penalty from 1.15 to 1.0 did not fix the failures.
|
|
|
|
## Verdict
|
|
|
|
The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows
|
|
the prompt format and stops correctly. However, none of the three SFT variants is a
|
|
stable high-quality chat model yet.
|
|
|
|
The limiting issue is no longer infrastructure. It is data and objective quality:
|
|
the current synthetic/anchor data is too narrow, while the current real-mix data
|
|
adds breadth but also noisy or low-quality behavior. Validation loss alone is not a
|
|
sufficient selection signal for chat quality.
|
|
|
|
## Recommended Next Step
|
|
|
|
Build a smaller, higher-precision SFT curriculum before another large run:
|
|
|
|
1. Keep the anchor data, but reduce over-refusal templates.
|
|
2. Add verified small instruction sets for math, code, translation, summarization,
|
|
and closed-book common facts.
|
|
3. Add an automatic fixed-prompt eval harness that scores exact-match math, simple
|
|
code syntax, refusal appropriateness, stop-token behavior, and corruption.
|
|
4. Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick
|
|
by generation eval rather than SFT loss alone.
|