docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check
- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
251
docs/runs/12-v12-1b-longctx-chat-alpha.md
Normal file
251
docs/runs/12-v12-1b-longctx-chat-alpha.md
Normal file
@@ -0,0 +1,251 @@
|
|||||||
|
# Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
v11 proved that a larger `dim1536/20L` model can train at `seq1024` on dash5, and it improved the fixed-eval-data-v1, long-context (`seq1024`) score to **2.7467**. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss.
|
||||||
|
|
||||||
|
v12 therefore separates the next phase into two gates:
|
||||||
|
|
||||||
|
1. **Base gate**: train a stronger English base model around 1B total params with `seq1024`, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1.
|
||||||
|
2. **Chat gate**: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone.
|
||||||
|
|
||||||
|
Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts.
|
||||||
|
|
||||||
|
## Baseline: What v11 Taught Us
|
||||||
|
|
||||||
|
| item | v11 |
|
||||||
|
|------|-----|
|
||||||
|
| arch | dim1536 / 20L / 48q-12kv GQA / ffn6144 |
|
||||||
|
| params | 684.26M core / 838.65M total |
|
||||||
|
| data | FineWeb-edu 6.765B token, 1 epoch |
|
||||||
|
| context | seq1024 |
|
||||||
|
| throughput | ~30.96K tok/s on 8 x RTX 5090 |
|
||||||
|
| fixed eval data v1, seq1024 | **2.7467** |
|
||||||
|
| issue | greedy repetition remains; direct real SFT regressed generation quality |
|
||||||
|
|
||||||
|
SFT result from v11:
|
||||||
|
|
||||||
|
| model | train result | generation result |
|
||||||
|
|-------|--------------|-------------------|
|
||||||
|
| `v11-chat-alpha-sft-v2-anchor` | synthetic assistant-only anchor | current best narrow chat-alpha |
|
||||||
|
| `v11-chat-alpha-real-sft-v1` | SFT val 1.4272 | bad hallucination, math failure |
|
||||||
|
| `v11-chat-alpha-real-mix-v1` | SFT val 2.0543 | better than direct real-SFT, still worse than anchor |
|
||||||
|
|
||||||
|
Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4.
|
||||||
|
|
||||||
|
| item | value |
|
||||||
|
|------|-------|
|
||||||
|
| dim | **1664** |
|
||||||
|
| layers | **22** |
|
||||||
|
| query heads x head_dim | **52 x 32** |
|
||||||
|
| kv heads | **13** |
|
||||||
|
| GQA group | 4 |
|
||||||
|
| ffn | **6656** |
|
||||||
|
| core params | **883.4M** |
|
||||||
|
| embed + lm_head | **167.3M** |
|
||||||
|
| total params | **1.0506B** |
|
||||||
|
|
||||||
|
Why this shape:
|
||||||
|
|
||||||
|
- It is a controlled step from v11 rather than a new architecture family.
|
||||||
|
- `52/13` preserves true GQA with group 4.
|
||||||
|
- Total params are near the requested 1B target.
|
||||||
|
- `dim1664` is less aggressive than `dim1792/22L` and has a better chance to fit `seq1024` on 32GB 5090s.
|
||||||
|
|
||||||
|
## Data
|
||||||
|
|
||||||
|
Base pretraining stays English-oriented and uses the current token cache. Pass the `.txt` stem to xtrain; `Corpus::load_cached` appends `.u16.bin` internally.
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/wjh/projects/xtrain/data/fineweb-edu.txt
|
||||||
|
cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin
|
||||||
|
tokens = 6,765,333,808
|
||||||
|
```
|
||||||
|
|
||||||
|
Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context `seq1024` setting, matching the v11 `eval_v11_seq1024.log` score of **2.7467**. This is distinct from the older v10 table that used the same fixed eval data with `seq256`.
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt
|
||||||
|
cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin
|
||||||
|
```
|
||||||
|
|
||||||
|
No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change.
|
||||||
|
|
||||||
|
## Training Plan
|
||||||
|
|
||||||
|
Primary v12 run:
|
||||||
|
|
||||||
|
| item | value |
|
||||||
|
|------|-------|
|
||||||
|
| world | 8 x RTX 5090 on dash5 |
|
||||||
|
| precision | bf16 mixed precision, fp32 master |
|
||||||
|
| memory stack | recompute + flash + grad accumulation |
|
||||||
|
| seq | **1024** |
|
||||||
|
| micro global batch | **16** (2 sequences/rank) |
|
||||||
|
| accum | **15** |
|
||||||
|
| effective global batch | **240** |
|
||||||
|
| tokens/step | **245,760** |
|
||||||
|
| full steps | **27,524** |
|
||||||
|
| max_lr → min_lr | **4e-4 → 4e-5** |
|
||||||
|
| eval | moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints |
|
||||||
|
| smoke throughput | **~24.5K tok/s** |
|
||||||
|
| estimated full wall clock | **~76-78h** |
|
||||||
|
|
||||||
|
The reduced micro-batch is intentional: v11 `seq1024` with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial `batch24/accum10` smoke OOMed after step 0, while `batch16/accum15` passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step.
|
||||||
|
|
||||||
|
Command wrapper:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/run_v12_phase.sh start-pilot
|
||||||
|
scripts/run_v12_phase.sh start-full
|
||||||
|
scripts/run_v12_phase.sh status
|
||||||
|
scripts/run_v12_phase.sh eval-fixed
|
||||||
|
scripts/run_v12_phase.sh export
|
||||||
|
scripts/run_v12_phase.sh sample
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gates
|
||||||
|
|
||||||
|
### Gate 0: build and smoke
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/run_v12_phase.sh smoke
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass criteria:
|
||||||
|
|
||||||
|
- no CUDA OOM
|
||||||
|
- no NaN loss
|
||||||
|
- first 30 steps decrease from initialization
|
||||||
|
- peak memory leaves enough margin for eval
|
||||||
|
|
||||||
|
### Gate 1: pilot
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/run_v12_phase.sh start-pilot
|
||||||
|
```
|
||||||
|
|
||||||
|
Default pilot is 300 steps with held-out eval every 100 steps.
|
||||||
|
|
||||||
|
Pass criteria:
|
||||||
|
|
||||||
|
- train loss decreases smoothly
|
||||||
|
- grad norm does not spike persistently
|
||||||
|
- moving-tail eval is finite and improving
|
||||||
|
- checkpoint can be reloaded by `eval-fixed`
|
||||||
|
|
||||||
|
### Gate 2: full base
|
||||||
|
|
||||||
|
Run only after the pilot passes:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/run_v12_phase.sh start-full
|
||||||
|
```
|
||||||
|
|
||||||
|
Pass criteria:
|
||||||
|
|
||||||
|
- fixed eval data v1 at `seq1024` beats v11's **2.7467**
|
||||||
|
- generation samples improve or at least do not regress on repetition
|
||||||
|
- checkpoint exports and xserv loads the true GQA config
|
||||||
|
|
||||||
|
### Gate 3: chat-alpha SFT
|
||||||
|
|
||||||
|
After a healthy v12 base:
|
||||||
|
|
||||||
|
1. Use assistant-only SFT (`--sft-tsv`) with English-only data.
|
||||||
|
2. Start from narrow anchors first, then mix in Smol-SmolTalk.
|
||||||
|
3. Judge with fixed generation prompts before calling it useful.
|
||||||
|
|
||||||
|
The primary high-quality source remains `HuggingFaceTB/smol-smoltalk` filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior.
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
Base metrics:
|
||||||
|
|
||||||
|
- moving-tail val during training
|
||||||
|
- fixed eval data v1 at `seq1024`
|
||||||
|
- xtrain fixed prompt samples from `scripts/chat_alpha_fixed_prompts.txt`
|
||||||
|
- xserv exported-model smoke
|
||||||
|
|
||||||
|
Chat metrics:
|
||||||
|
|
||||||
|
- fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation
|
||||||
|
- compare against `v11-chat-alpha-sft-v2-anchor`
|
||||||
|
- reject models that lower SFT validation loss but hallucinate more in fixed prompts
|
||||||
|
|
||||||
|
## Artifacts
|
||||||
|
|
||||||
|
Expected paths:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_v12/
|
||||||
|
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt
|
||||||
|
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
|
||||||
|
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
|
||||||
|
```
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
### Gate 0/1: smoke + pilot
|
||||||
|
|
||||||
|
- `batch24/accum10` smoke OOMed after step 0.
|
||||||
|
- `batch16/accum15` smoke passed 10 steps: train loss **11.2347 -> 7.9459**, ~24.5K tok/s, ~29.4GB/GPU.
|
||||||
|
- 300-step pilot passed: train loss **11.2296 -> 5.4832**, val **6.5810 -> 5.9642 -> 5.5888**, exit code 0.
|
||||||
|
- Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = **5.5891**.
|
||||||
|
- Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality.
|
||||||
|
|
||||||
|
### Gate 2: full base
|
||||||
|
|
||||||
|
Full run completed on dash5:
|
||||||
|
|
||||||
|
| item | result |
|
||||||
|
|------|--------|
|
||||||
|
| wall clock | **81h01m** |
|
||||||
|
| throughput | **~24.55K tok/s** |
|
||||||
|
| train loss | **11.2294 -> 2.6696** |
|
||||||
|
| moving-tail best val | **2.7411** |
|
||||||
|
| moving-tail final val | **2.7412** |
|
||||||
|
| fixed eval data v1, seq1024 reload | **2.7410** |
|
||||||
|
| exit code | **0** |
|
||||||
|
|
||||||
|
Validation milestones:
|
||||||
|
|
||||||
|
| step | 499 | 999 | 1499 | 1999 | 2499 | 21999 | 23999 | 25999 | 26999 | 27499 | final |
|
||||||
|
|------|-----|-----|------|------|------|-------|-------|-------|-------|-------|-------|
|
||||||
|
| val | 5.3029 | 4.4079 | 3.9287 | 3.6964 | 3.5555 | 2.7805 | 2.7637 | 2.7468 | 2.7443 | **2.7411** | 2.7412 |
|
||||||
|
|
||||||
|
Compared with v11's fixed eval data v1 at seq1024 (**2.7467**), v12 reaches **2.7410** after reload. This is a real but very small gain
|
||||||
|
(~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the
|
||||||
|
larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump.
|
||||||
|
|
||||||
|
Generation:
|
||||||
|
|
||||||
|
- Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose.
|
||||||
|
- Greedy repetition remains visible, especially on story-like prompts.
|
||||||
|
- Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text.
|
||||||
|
- xserv loads the export correctly as true GQA: `layers=22, hidden=1664, heads=52/13 kv`.
|
||||||
|
|
||||||
|
Exported model:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx
|
||||||
|
```
|
||||||
|
|
||||||
|
Files:
|
||||||
|
|
||||||
|
- `config.json`
|
||||||
|
- `model.safetensors` (2.0GB)
|
||||||
|
- `tokenizer.json`
|
||||||
|
- `xtrain.ckpt` (4.0GB)
|
||||||
|
|
||||||
|
Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be
|
||||||
|
assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to
|
||||||
|
serve as a usable chat model.
|
||||||
180
docs/runs/13-v12-chat-sft-quality.md
Normal file
180
docs/runs/13-v12-chat-sft-quality.md
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
# v12 Chat SFT Quality Check
|
||||||
|
|
||||||
|
Date: 2026-06-29
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with
|
||||||
|
SFT, then judge whether it is stable enough to call a high-quality chat model.
|
||||||
|
|
||||||
|
Base checkpoint:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt
|
||||||
|
```
|
||||||
|
|
||||||
|
Architecture:
|
||||||
|
|
||||||
|
```text
|
||||||
|
dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656
|
||||||
|
total params=1.0506B
|
||||||
|
```
|
||||||
|
|
||||||
|
## Stage A: Synthetic SFT
|
||||||
|
|
||||||
|
Data:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv
|
||||||
|
211,257 examples, about 14.96M SFT tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics:
|
||||||
|
|
||||||
|
```text
|
||||||
|
train loss: 3.5730 -> 0.0426
|
||||||
|
eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423,
|
||||||
|
step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378
|
||||||
|
best/final val loss: 0.0378
|
||||||
|
```
|
||||||
|
|
||||||
|
Export:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2
|
||||||
|
```
|
||||||
|
|
||||||
|
Quality notes:
|
||||||
|
|
||||||
|
- Learns the User/Assistant format and usually stops correctly.
|
||||||
|
- Too narrow and template-heavy.
|
||||||
|
- Fails basic math and code prompts in fixed greedy evaluation.
|
||||||
|
|
||||||
|
## Stage B: Anchor SFT
|
||||||
|
|
||||||
|
Data:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv
|
||||||
|
32,020 examples, about 1.73M SFT tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics:
|
||||||
|
|
||||||
|
```text
|
||||||
|
train loss: 1.7777 -> 0.1165
|
||||||
|
eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158
|
||||||
|
best/final val loss: 0.1158
|
||||||
|
```
|
||||||
|
|
||||||
|
Export:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor
|
||||||
|
```
|
||||||
|
|
||||||
|
Generation artifacts:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Quality notes:
|
||||||
|
|
||||||
|
- Better project-context answers and summaries than synthetic-only.
|
||||||
|
- Still unreliable on basic multiplication, yes/no facts, translation, and code.
|
||||||
|
- Overuses "cannot verify" style answers outside appropriate uncertainty cases.
|
||||||
|
|
||||||
|
## Stage C: Real-Mix Repair
|
||||||
|
|
||||||
|
Data:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv
|
||||||
|
96,287 examples, about 25.3M SFT tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
Run:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt
|
||||||
|
```
|
||||||
|
|
||||||
|
Training setup:
|
||||||
|
|
||||||
|
```text
|
||||||
|
init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt
|
||||||
|
steps=200
|
||||||
|
seq=512
|
||||||
|
batch=32
|
||||||
|
accum=8
|
||||||
|
effective batch=256
|
||||||
|
lr=1e-6 -> 2e-7
|
||||||
|
```
|
||||||
|
|
||||||
|
Metrics:
|
||||||
|
|
||||||
|
```text
|
||||||
|
train loss: 2.7391 -> 2.0384
|
||||||
|
eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570
|
||||||
|
best/final val loss: 1.9570
|
||||||
|
```
|
||||||
|
|
||||||
|
Export:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair
|
||||||
|
```
|
||||||
|
|
||||||
|
Generation artifacts:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt
|
||||||
|
/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Quality notes:
|
||||||
|
|
||||||
|
- Loss improved cleanly and the model kept chat formatting.
|
||||||
|
- Fixed prompt math `17% of 240` improved in the standard suite.
|
||||||
|
- General diagnostic math still fails, e.g. `12 * 13`.
|
||||||
|
- Code generation remains unusable for simple Python function prompts.
|
||||||
|
- Some outputs contain corrupted or off-topic fragments.
|
||||||
|
- Reducing repeat penalty from 1.15 to 1.0 did not fix the failures.
|
||||||
|
|
||||||
|
## Verdict
|
||||||
|
|
||||||
|
The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows
|
||||||
|
the prompt format and stops correctly. However, none of the three SFT variants is a
|
||||||
|
stable high-quality chat model yet.
|
||||||
|
|
||||||
|
The limiting issue is no longer infrastructure. It is data and objective quality:
|
||||||
|
the current synthetic/anchor data is too narrow, while the current real-mix data
|
||||||
|
adds breadth but also noisy or low-quality behavior. Validation loss alone is not a
|
||||||
|
sufficient selection signal for chat quality.
|
||||||
|
|
||||||
|
## Recommended Next Step
|
||||||
|
|
||||||
|
Build a smaller, higher-precision SFT curriculum before another large run:
|
||||||
|
|
||||||
|
1. Keep the anchor data, but reduce over-refusal templates.
|
||||||
|
2. Add verified small instruction sets for math, code, translation, summarization,
|
||||||
|
and closed-book common facts.
|
||||||
|
3. Add an automatic fixed-prompt eval harness that scores exact-match math, simple
|
||||||
|
code syntax, refusal appropriateness, stop-token behavior, and corruption.
|
||||||
|
4. Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick
|
||||||
|
by generation eval rather than SFT loss alone.
|
||||||
10
scripts/chat_alpha_fixed_prompts.txt
Normal file
10
scripts/chat_alpha_fixed_prompts.txt
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
# One escaped prompt per line. `greedy_sample` decodes literal \n before tokenizing.
|
||||||
|
User: Explain supervised fine-tuning to a junior engineer.\nAssistant:
|
||||||
|
User: What high-quality SFT data are we using now?\nAssistant:
|
||||||
|
User: What training data did chat-alpha-v1 use?\nAssistant:
|
||||||
|
User: What is 17% of 240?\nAssistant:
|
||||||
|
User: I found that my small language model repeats the same phrase during generation. What should I inspect first?\nAssistant:
|
||||||
|
User: Summarize this passage in one sentence: A team trained a base model, then continued with chat examples at a low learning rate. Validation loss improved, but they still need real prompt tests before calling it useful.\nAssistant:
|
||||||
|
User: Who will win the world championship in 2099?\nAssistant:
|
||||||
|
User: Give a compact checklist before launching an SFT run.\nAssistant:
|
||||||
|
User: Write a Python function that returns the larger of two numbers.\nAssistant:
|
||||||
329
scripts/run_v12_phase.sh
Executable file
329
scripts/run_v12_phase.sh
Executable file
@@ -0,0 +1,329 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ROOT="${XTRAIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}"
|
||||||
|
cd "$ROOT"
|
||||||
|
|
||||||
|
export PATH="/usr/local/cuda/bin:/opt/wjh/.cargo/bin:$PATH"
|
||||||
|
|
||||||
|
strip_token_cache_suffix() {
|
||||||
|
local path="$1"
|
||||||
|
if [[ "$path" == *.u16.bin ]]; then
|
||||||
|
printf '%s\n' "${path%.u16.bin}"
|
||||||
|
else
|
||||||
|
printf '%s\n' "$path"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
RUN_DIR="${RUN_DIR:-/dashscope-tmp/wjh/xtrain_v12}"
|
||||||
|
TOKENIZER="${TOKENIZER:-/opt/wjh/models/gpt2/tokenizer.json}"
|
||||||
|
CORPUS="${CORPUS:-data/fineweb-edu.txt}"
|
||||||
|
FIXED_EVAL="${FIXED_EVAL:-/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt}"
|
||||||
|
EXPORT_DIR="${EXPORT_DIR:-/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx}"
|
||||||
|
CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}"
|
||||||
|
TMUX_SESSION="${TMUX_SESSION:-xtrain_v12}"
|
||||||
|
|
||||||
|
HEADS="${HEADS:-52}"
|
||||||
|
HEAD_DIM="${HEAD_DIM:-32}"
|
||||||
|
KV_HEADS="${KV_HEADS:-13}"
|
||||||
|
LAYERS="${LAYERS:-22}"
|
||||||
|
FFN="${FFN:-6656}"
|
||||||
|
SEQ="${SEQ:-1024}"
|
||||||
|
BATCH="${BATCH:-16}"
|
||||||
|
ACCUM="${ACCUM:-15}"
|
||||||
|
MAX_LR="${MAX_LR:-4e-4}"
|
||||||
|
MIN_LR="${MIN_LR:-4e-5}"
|
||||||
|
VAL_TOKENS="${VAL_TOKENS:-1000000}"
|
||||||
|
EVAL_BATCHES="${EVAL_BATCHES:-64}"
|
||||||
|
FIXED_EVAL_SEQ="${FIXED_EVAL_SEQ:-1024}"
|
||||||
|
FIXED_EVAL_BATCHES="${FIXED_EVAL_BATCHES:-64}"
|
||||||
|
PILOT_STEPS="${PILOT_STEPS:-300}"
|
||||||
|
FULL_STEPS="${FULL_STEPS:-27524}"
|
||||||
|
PILOT_EVAL_EVERY="${PILOT_EVAL_EVERY:-100}"
|
||||||
|
FULL_EVAL_EVERY="${FULL_EVAL_EVERY:-500}"
|
||||||
|
|
||||||
|
CORPUS="$(strip_token_cache_suffix "$CORPUS")"
|
||||||
|
FIXED_EVAL="$(strip_token_cache_suffix "$FIXED_EVAL")"
|
||||||
|
|
||||||
|
ARCH_ARGS=(
|
||||||
|
--heads "$HEADS"
|
||||||
|
--head-dim "$HEAD_DIM"
|
||||||
|
--kv-heads "$KV_HEADS"
|
||||||
|
--layers "$LAYERS"
|
||||||
|
--ffn "$FFN"
|
||||||
|
)
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
usage: scripts/run_v12_phase.sh ACTION
|
||||||
|
|
||||||
|
Actions:
|
||||||
|
build Build xtrain train/export/sample binaries.
|
||||||
|
smoke Run a short no-checkpoint v12 seq1024 smoke test in foreground.
|
||||||
|
pilot Run a 300-step v12 pilot with held-out eval and checkpoint.
|
||||||
|
full Run the full one-epoch v12 base training job.
|
||||||
|
eval-fixed Evaluate a checkpoint on fixed eval v1.
|
||||||
|
sample Run xtrain greedy_sample on fixed chat-alpha prompts.
|
||||||
|
export Export a checkpoint to xserv/tiny-models format.
|
||||||
|
status Print one progress snapshot from RUN_DIR/full.log or pilot.log.
|
||||||
|
monitor Show a refreshing progress dashboard until interrupted.
|
||||||
|
start-pilot Start pilot + monitor in tmux sessions.
|
||||||
|
start-full Start full train + monitor in tmux sessions.
|
||||||
|
|
||||||
|
Environment overrides:
|
||||||
|
RUN_DIR, TOKENIZER, CORPUS, FIXED_EVAL, EXPORT_DIR, CUDA_VISIBLE_DEVICES
|
||||||
|
HEADS, HEAD_DIM, KV_HEADS, LAYERS, FFN, SEQ, BATCH, ACCUM
|
||||||
|
MAX_LR, MIN_LR, PILOT_STEPS, FULL_STEPS, FIXED_EVAL_SEQ
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
build() {
|
||||||
|
cargo build --release -p xtrain-distributed --bin train_ddp
|
||||||
|
cargo build --release -p xtrain-train --bin train --bin export_safetensors --bin greedy_sample
|
||||||
|
}
|
||||||
|
|
||||||
|
write_meta() {
|
||||||
|
local kind="$1"
|
||||||
|
mkdir -p "$RUN_DIR"
|
||||||
|
{
|
||||||
|
echo "run=$kind"
|
||||||
|
echo "created_utc=$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||||
|
echo "arch=heads${HEADS}_hd${HEAD_DIM}_kv${KV_HEADS}_layers${LAYERS}_ffn${FFN}"
|
||||||
|
echo "seq=$SEQ"
|
||||||
|
echo "batch=$BATCH"
|
||||||
|
echo "accum=$ACCUM"
|
||||||
|
echo "effective_batch=$((BATCH * ACCUM))"
|
||||||
|
echo "tokens_per_step=$((BATCH * ACCUM * SEQ))"
|
||||||
|
echo "max_lr=$MAX_LR"
|
||||||
|
echo "min_lr=$MIN_LR"
|
||||||
|
echo "corpus=$CORPUS"
|
||||||
|
echo "fixed_eval=$FIXED_EVAL"
|
||||||
|
echo "fixed_eval_seq=$FIXED_EVAL_SEQ"
|
||||||
|
} > "$RUN_DIR/META.txt"
|
||||||
|
}
|
||||||
|
|
||||||
|
write_env_file() {
|
||||||
|
mkdir -p "$RUN_DIR"
|
||||||
|
local env_file="$RUN_DIR/env.sh"
|
||||||
|
: > "$env_file"
|
||||||
|
local names=(
|
||||||
|
XTRAIN_ROOT RUN_DIR TOKENIZER CORPUS FIXED_EVAL EXPORT_DIR CUDA_VISIBLE_DEVICES
|
||||||
|
TMUX_SESSION HEADS HEAD_DIM KV_HEADS LAYERS FFN SEQ BATCH ACCUM MAX_LR MIN_LR
|
||||||
|
VAL_TOKENS EVAL_BATCHES FIXED_EVAL_SEQ FIXED_EVAL_BATCHES PILOT_STEPS
|
||||||
|
FULL_STEPS PILOT_EVAL_EVERY FULL_EVAL_EVERY
|
||||||
|
)
|
||||||
|
for name in "${names[@]}"; do
|
||||||
|
if [[ "$name" == "XTRAIN_ROOT" ]]; then
|
||||||
|
printf 'export XTRAIN_ROOT=%q\n' "$ROOT" >> "$env_file"
|
||||||
|
else
|
||||||
|
printf 'export %s=%q\n' "$name" "${!name}" >> "$env_file"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
run_train() {
|
||||||
|
local kind="$1"
|
||||||
|
local steps="$2"
|
||||||
|
local eval_every="$3"
|
||||||
|
local ckpt="$4"
|
||||||
|
local log="$RUN_DIR/${kind}.log"
|
||||||
|
write_meta "$kind"
|
||||||
|
echo "$steps" > "$RUN_DIR/${kind}.steps"
|
||||||
|
echo "$((BATCH * ACCUM * SEQ))" > "$RUN_DIR/${kind}.tokens_per_step"
|
||||||
|
{
|
||||||
|
echo "RUN_NAME=xtrain_v12_${kind}"
|
||||||
|
echo "RUN_START_ISO=$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||||
|
echo "RUN_START_EPOCH=$(date +%s)"
|
||||||
|
echo "CKPT=$ckpt"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
|
||||||
|
echo "TOTAL_STEPS=$steps"
|
||||||
|
echo "TOKENS_PER_STEP=$((BATCH * ACCUM * SEQ))"
|
||||||
|
set -x
|
||||||
|
set +e
|
||||||
|
if [[ -n "$ckpt" ]]; then
|
||||||
|
CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES" target/release/train_ddp \
|
||||||
|
"$TOKENIZER" "$CORPUS" \
|
||||||
|
"${ARCH_ARGS[@]}" \
|
||||||
|
--steps "$steps" --batch "$BATCH" --accum-steps "$ACCUM" --seq "$SEQ" \
|
||||||
|
--max-lr "$MAX_LR" --min-lr "$MIN_LR" \
|
||||||
|
--val-tokens "$VAL_TOKENS" --eval-every "$eval_every" --eval-batches "$EVAL_BATCHES" \
|
||||||
|
--bf16 --recompute --flash --dropout 0.0 \
|
||||||
|
--ckpt "$ckpt"
|
||||||
|
rc=$?
|
||||||
|
else
|
||||||
|
CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES" target/release/train_ddp \
|
||||||
|
"$TOKENIZER" "$CORPUS" \
|
||||||
|
"${ARCH_ARGS[@]}" \
|
||||||
|
--steps "$steps" --batch "$BATCH" --accum-steps "$ACCUM" --seq "$SEQ" \
|
||||||
|
--max-lr "$MAX_LR" --min-lr "$MIN_LR" \
|
||||||
|
--val-tokens 0 --eval-every 0 --eval-batches "$EVAL_BATCHES" \
|
||||||
|
--bf16 --recompute --flash --dropout 0.0
|
||||||
|
rc=$?
|
||||||
|
fi
|
||||||
|
set -e
|
||||||
|
set +x
|
||||||
|
echo "RUN_END_ISO=$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||||
|
echo "RUN_EXIT_CODE=$rc"
|
||||||
|
exit "$rc"
|
||||||
|
} 2>&1 | tee "$log"
|
||||||
|
}
|
||||||
|
|
||||||
|
checkpoint_path() {
|
||||||
|
local preferred="$RUN_DIR/xtrain_v12.ckpt"
|
||||||
|
local pilot="$RUN_DIR/xtrain_v12_pilot.ckpt"
|
||||||
|
if [[ -n "${CKPT:-}" ]]; then
|
||||||
|
echo "$CKPT"
|
||||||
|
elif [[ -f "$preferred" ]]; then
|
||||||
|
echo "$preferred"
|
||||||
|
else
|
||||||
|
echo "$pilot"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
eval_fixed() {
|
||||||
|
local ckpt
|
||||||
|
ckpt="$(checkpoint_path)"
|
||||||
|
target/release/train \
|
||||||
|
"$TOKENIZER" "$FIXED_EVAL" \
|
||||||
|
"${ARCH_ARGS[@]}" \
|
||||||
|
--seq "$FIXED_EVAL_SEQ" --batch 1 --steps 1 \
|
||||||
|
--val-tokens "$VAL_TOKENS" --eval-batches "$FIXED_EVAL_BATCHES" \
|
||||||
|
--bf16 --recompute --flash \
|
||||||
|
--eval-ckpt "$ckpt" \
|
||||||
|
2>&1 | tee "$RUN_DIR/eval_fixed.log"
|
||||||
|
}
|
||||||
|
|
||||||
|
sample_fixed() {
|
||||||
|
local ckpt
|
||||||
|
ckpt="$(checkpoint_path)"
|
||||||
|
target/release/greedy_sample \
|
||||||
|
"$ckpt" "$TOKENIZER" \
|
||||||
|
"${ARCH_ARGS[@]}" \
|
||||||
|
--max-tokens "${MAX_TOKENS:-120}" \
|
||||||
|
--temperature "${TEMPERATURE:-0}" \
|
||||||
|
--prompts-file "${PROMPTS_FILE:-scripts/chat_alpha_fixed_prompts.txt}" \
|
||||||
|
2>&1 | tee "$RUN_DIR/sample_fixed.log"
|
||||||
|
}
|
||||||
|
|
||||||
|
export_model() {
|
||||||
|
local ckpt
|
||||||
|
ckpt="$(checkpoint_path)"
|
||||||
|
rm -rf "$EXPORT_DIR"
|
||||||
|
target/release/export_safetensors \
|
||||||
|
"$ckpt" "$TOKENIZER" "$EXPORT_DIR" \
|
||||||
|
"${ARCH_ARGS[@]}"
|
||||||
|
cp "$ckpt" "$EXPORT_DIR/xtrain.ckpt"
|
||||||
|
echo "$EXPORT_DIR" | tee "$RUN_DIR/export_path.txt"
|
||||||
|
}
|
||||||
|
|
||||||
|
progress_once() {
|
||||||
|
local log="${1:-$RUN_DIR/full.log}"
|
||||||
|
[[ -f "$log" ]] || log="$RUN_DIR/pilot.log"
|
||||||
|
python3 - "$log" <<'PY'
|
||||||
|
import os, re, sys, time
|
||||||
|
log = sys.argv[1]
|
||||||
|
text = open(log, errors="ignore").read() if os.path.exists(log) else ""
|
||||||
|
steps = re.findall(r"\[rank0\] step\s+(\d+)/(\d+): loss\s+(\S+) lr\s+(\S+) gnorm\s+(\S+) \((\S+) tok/s global", text)
|
||||||
|
evals = re.findall(r"eval @ step\s+(\d+): val loss\s+(\S+)( \(best\))?", text)
|
||||||
|
start = re.search(r"RUN_START_EPOCH=(\d+)", text)
|
||||||
|
tokens_per_step = re.search(r"TOKENS_PER_STEP=(\d+)", text)
|
||||||
|
tokens_per_step = int(tokens_per_step.group(1)) if tokens_per_step else 245760
|
||||||
|
exit_code = re.search(r"RUN_EXIT_CODE=(\d+)", text)
|
||||||
|
warnings = re.findall(r"(?i)(nan|inf|oom|out of memory|panic|error)", text)
|
||||||
|
print("xtrain v12 |", time.strftime("%Y-%m-%d %H:%M:%S %Z"), "| log:", log)
|
||||||
|
if warnings:
|
||||||
|
print("WARNING: suspicious log tokens:", ", ".join(sorted(set(w.lower() for w in warnings))[:8]))
|
||||||
|
if not steps:
|
||||||
|
print("waiting for first rank0 step")
|
||||||
|
else:
|
||||||
|
s, total, loss, lr, gnorm, tps = steps[-1]
|
||||||
|
done = int(s) + 1
|
||||||
|
total = int(total)
|
||||||
|
pct = min(100.0, done * 100.0 / total)
|
||||||
|
width = 44
|
||||||
|
fill = int(width * pct / 100.0)
|
||||||
|
bar = "#" * fill + "." * (width - fill)
|
||||||
|
try:
|
||||||
|
tpsf = float(tps)
|
||||||
|
except ValueError:
|
||||||
|
tpsf = 0.0
|
||||||
|
elapsed = time.time() - int(start.group(1)) if start else None
|
||||||
|
eta = (total - done) * tokens_per_step / tpsf if tpsf > 0 else None
|
||||||
|
def fmt(sec):
|
||||||
|
if sec is None:
|
||||||
|
return "n/a"
|
||||||
|
sec = int(max(0, sec))
|
||||||
|
h, r = divmod(sec, 3600)
|
||||||
|
m, s = divmod(r, 60)
|
||||||
|
return f"{h:02d}:{m:02d}:{s:02d}"
|
||||||
|
print(f"[{bar}] {pct:6.2f}%")
|
||||||
|
print(f"step {done}/{total} | loss {loss} | lr {lr} | gnorm {gnorm}")
|
||||||
|
print(f"speed {tpsf:,.0f} tok/s | elapsed {fmt(elapsed)} | ETA {fmt(eta)}")
|
||||||
|
if evals:
|
||||||
|
s, v, best = evals[-1]
|
||||||
|
best_vals = []
|
||||||
|
for _, vv, mark in evals:
|
||||||
|
if not mark:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
best_vals.append(float(vv))
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
best_txt = f"best {min(best_vals):.4f}" if best_vals else "best n/a"
|
||||||
|
try:
|
||||||
|
val_txt = f"{float(v):.4f}"
|
||||||
|
except ValueError:
|
||||||
|
val_txt = v
|
||||||
|
print(f"eval step {int(s)+1}: val {val_txt} {best.strip()} | {best_txt}")
|
||||||
|
else:
|
||||||
|
print("eval: waiting")
|
||||||
|
if exit_code:
|
||||||
|
print("FINISHED exit code", exit_code.group(1))
|
||||||
|
PY
|
||||||
|
echo
|
||||||
|
nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv,noheader,nounits \
|
||||||
|
| awk -F, '{printf "gpu%s %sMiB %s%% ", $1, $2, $3} NR%4==0{print ""} END{print ""}'
|
||||||
|
df -h /dashscope-tmp | awk 'NR==2{print "Disk: "$4" free ("$5" used)"}'
|
||||||
|
}
|
||||||
|
|
||||||
|
monitor() {
|
||||||
|
while true; do
|
||||||
|
clear
|
||||||
|
progress_once
|
||||||
|
sleep "${MONITOR_INTERVAL:-30}"
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
start_tmux() {
|
||||||
|
local kind="$1"
|
||||||
|
local session="$TMUX_SESSION"
|
||||||
|
if tmux has-session -t "=${session}" 2>/dev/null; then
|
||||||
|
echo "tmux session already exists: $session"
|
||||||
|
echo "attach: tmux attach -t $session"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
write_env_file
|
||||||
|
tmux new-session -d -s "$session" "bash -lc 'source \"$RUN_DIR/env.sh\" && cd \"$ROOT\" && scripts/run_v12_phase.sh $kind'"
|
||||||
|
if ! tmux has-session -t "=${session}_mon" 2>/dev/null; then
|
||||||
|
tmux new-session -d -s "${session}_mon" "bash -lc 'source \"$RUN_DIR/env.sh\" && cd \"$ROOT\" && scripts/run_v12_phase.sh monitor'"
|
||||||
|
fi
|
||||||
|
echo "started $kind in tmux: $session"
|
||||||
|
echo "monitor: tmux attach -t ${session}_mon"
|
||||||
|
}
|
||||||
|
|
||||||
|
action="${1:-}"
|
||||||
|
case "$action" in
|
||||||
|
build) build ;;
|
||||||
|
smoke) build; run_train smoke "${SMOKE_STEPS:-30}" 0 "" ;;
|
||||||
|
pilot) build; run_train pilot "$PILOT_STEPS" "$PILOT_EVAL_EVERY" "$RUN_DIR/xtrain_v12_pilot.ckpt" ;;
|
||||||
|
full) build; run_train full "$FULL_STEPS" "$FULL_EVAL_EVERY" "$RUN_DIR/xtrain_v12.ckpt" ;;
|
||||||
|
eval-fixed) build; eval_fixed ;;
|
||||||
|
sample) build; sample_fixed ;;
|
||||||
|
export) build; export_model ;;
|
||||||
|
status) progress_once ;;
|
||||||
|
monitor) monitor ;;
|
||||||
|
start-pilot) start_tmux pilot ;;
|
||||||
|
start-full) start_tmux full ;;
|
||||||
|
""|-h|--help|help) usage ;;
|
||||||
|
*) echo "unknown action: $action" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
Reference in New Issue
Block a user