# v12 Chat SFT Quality Check Date: 2026-06-29 ## Goal Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with SFT, then judge whether it is stable enough to call a high-quality chat model. Base checkpoint: ```text /dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt ``` Architecture: ```text dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656 total params=1.0506B ``` ## Stage A: Synthetic SFT Data: ```text /dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv 211,257 examples, about 14.96M SFT tokens ``` Run: ```text /dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt ``` Metrics: ```text train loss: 3.5730 -> 0.0426 eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423, step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378 best/final val loss: 0.0378 ``` Export: ```text /opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2 ``` Quality notes: - Learns the User/Assistant format and usually stops correctly. - Too narrow and template-heavy. - Fails basic math and code prompts in fixed greedy evaluation. ## Stage B: Anchor SFT Data: ```text /dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv 32,020 examples, about 1.73M SFT tokens ``` Run: ```text /dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt ``` Metrics: ```text train loss: 1.7777 -> 0.1165 eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158 best/final val loss: 0.1158 ``` Export: ```text /opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor ``` Generation artifacts: ```text /dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt /dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt ``` Quality notes: - Better project-context answers and summaries than synthetic-only. - Still unreliable on basic multiplication, yes/no facts, translation, and code. - Overuses "cannot verify" style answers outside appropriate uncertainty cases. ## Stage C: Real-Mix Repair Data: ```text /dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv 96,287 examples, about 25.3M SFT tokens ``` Run: ```text /dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt ``` Training setup: ```text init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt steps=200 seq=512 batch=32 accum=8 effective batch=256 lr=1e-6 -> 2e-7 ``` Metrics: ```text train loss: 2.7391 -> 2.0384 eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570 best/final val loss: 1.9570 ``` Export: ```text /opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair ``` Generation artifacts: ```text /dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt /dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt /dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt ``` Quality notes: - Loss improved cleanly and the model kept chat formatting. - Fixed prompt math `17% of 240` improved in the standard suite. - General diagnostic math still fails, e.g. `12 * 13`. - Code generation remains unusable for simple Python function prompts. - Some outputs contain corrupted or off-topic fragments. - Reducing repeat penalty from 1.15 to 1.0 did not fix the failures. ## Verdict The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows the prompt format and stops correctly. However, none of the three SFT variants is a stable high-quality chat model yet. The limiting issue is no longer infrastructure. It is data and objective quality: the current synthetic/anchor data is too narrow, while the current real-mix data adds breadth but also noisy or low-quality behavior. Validation loss alone is not a sufficient selection signal for chat quality. ## Recommended Next Step Build a smaller, higher-precision SFT curriculum before another large run: 1. Keep the anchor data, but reduce over-refusal templates. 2. Add verified small instruction sets for math, code, translation, summarization, and closed-book common facts. 3. Add an automatic fixed-prompt eval harness that scores exact-match math, simple code syntax, refusal appropriateness, stop-token behavior, and corruption. 4. Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick by generation eval rather than SFT loss alone.