From 7a1fba95b51e219c80fcbf3fbbf391467baffdee Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Mon, 29 Jun 2026 16:19:12 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20v12=20=E2=80=94=201.05B=20long-ctx=20ba?= =?UTF-8?q?se=20+=20chat-alpha=20SFT=20quality=20check?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 --- docs/runs/12-v12-1b-longctx-chat-alpha.md | 251 +++++++++++++++++ docs/runs/13-v12-chat-sft-quality.md | 180 ++++++++++++ scripts/chat_alpha_fixed_prompts.txt | 10 + scripts/run_v12_phase.sh | 329 ++++++++++++++++++++++ 4 files changed, 770 insertions(+) create mode 100644 docs/runs/12-v12-1b-longctx-chat-alpha.md create mode 100644 docs/runs/13-v12-chat-sft-quality.md create mode 100644 scripts/chat_alpha_fixed_prompts.txt create mode 100755 scripts/run_v12_phase.sh diff --git a/docs/runs/12-v12-1b-longctx-chat-alpha.md b/docs/runs/12-v12-1b-longctx-chat-alpha.md new file mode 100644 index 0000000..4d3da09 --- /dev/null +++ b/docs/runs/12-v12-1b-longctx-chat-alpha.md @@ -0,0 +1,251 @@ +# Scaling Run v12: 1B-class long-context base → chat-alpha-v2 — Design Document + +## Goal + +v11 proved that a larger `dim1536/20L` model can train at `seq1024` on dash5, and it improved the fixed-eval-data-v1, long-context (`seq1024`) score to **2.7467**. It also proved the current bottleneck: greedy generation still repeats, and broad real-data SFT on top of v11 regressed chat quality despite lower SFT validation loss. + +v12 therefore separates the next phase into two gates: + +1. **Base gate**: train a stronger English base model around 1B total params with `seq1024`, using the existing FineWeb-edu 6.765B-token cache and fixed eval data v1. +2. **Chat gate**: only after the base gate is healthy, run assistant-only English SFT and judge it with fixed prompt generation, not SFT loss alone. + +Success means the model is serviceable enough for a small chat-alpha: stable base loss, lower fixed eval than v11, less repetitive fixed generation, and SFT that improves instruction behavior without destroying arithmetic/refusal/debug prompts. + +## Baseline: What v11 Taught Us + +| item | v11 | +|------|-----| +| arch | dim1536 / 20L / 48q-12kv GQA / ffn6144 | +| params | 684.26M core / 838.65M total | +| data | FineWeb-edu 6.765B token, 1 epoch | +| context | seq1024 | +| throughput | ~30.96K tok/s on 8 x RTX 5090 | +| fixed eval data v1, seq1024 | **2.7467** | +| issue | greedy repetition remains; direct real SFT regressed generation quality | + +SFT result from v11: + +| model | train result | generation result | +|-------|--------------|-------------------| +| `v11-chat-alpha-sft-v2-anchor` | synthetic assistant-only anchor | current best narrow chat-alpha | +| `v11-chat-alpha-real-sft-v1` | SFT val 1.4272 | bad hallucination, math failure | +| `v11-chat-alpha-real-mix-v1` | SFT val 2.0543 | better than direct real-SFT, still worse than anchor | + +Conclusion: SFT data quality matters, but v11's base is still too weak for broad real SFT to become a general chat model. + +## Architecture + +v12 target: slightly above 1B total params while staying close to the proven v11 shape and keeping GQA group size 4. + +| item | value | +|------|-------| +| dim | **1664** | +| layers | **22** | +| query heads x head_dim | **52 x 32** | +| kv heads | **13** | +| GQA group | 4 | +| ffn | **6656** | +| core params | **883.4M** | +| embed + lm_head | **167.3M** | +| total params | **1.0506B** | + +Why this shape: + +- It is a controlled step from v11 rather than a new architecture family. +- `52/13` preserves true GQA with group 4. +- Total params are near the requested 1B target. +- `dim1664` is less aggressive than `dim1792/22L` and has a better chance to fit `seq1024` on 32GB 5090s. + +## Data + +Base pretraining stays English-oriented and uses the current token cache. Pass the `.txt` stem to xtrain; `Corpus::load_cached` appends `.u16.bin` internally. + +```text +/opt/wjh/projects/xtrain/data/fineweb-edu.txt +cache = /opt/wjh/projects/xtrain/data/fineweb-edu.txt.u16.bin +tokens = 6,765,333,808 +``` + +Training uses the last 1M tokens as moving-tail validation. Every cross-version v12 claim must also run fixed eval data v1 with the long-context `seq1024` setting, matching the v11 `eval_v11_seq1024.log` score of **2.7467**. This is distinct from the older v10 table that used the same fixed eval data with `seq256`. + +```text +/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt +cache = /dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt.u16.bin +``` + +No new FineWeb shards are added in this phase. The experiment is model/context scale, not another data-axis change. + +## Training Plan + +Primary v12 run: + +| item | value | +|------|-------| +| world | 8 x RTX 5090 on dash5 | +| precision | bf16 mixed precision, fp32 master | +| memory stack | recompute + flash + grad accumulation | +| seq | **1024** | +| micro global batch | **16** (2 sequences/rank) | +| accum | **15** | +| effective global batch | **240** | +| tokens/step | **245,760** | +| full steps | **27,524** | +| max_lr → min_lr | **4e-4 → 4e-5** | +| eval | moving-tail 1M every 500 steps; fixed eval data v1 at seq1024 after checkpoints | +| smoke throughput | **~24.5K tok/s** | +| estimated full wall clock | **~76-78h** | + +The reduced micro-batch is intentional: v11 `seq1024` with global batch 32 already sat near the 5090 memory limit. v12 has larger weights; an initial `batch24/accum10` smoke OOMed after step 0, while `batch16/accum15` passed a 10-step smoke at ~29.4GB/GPU and preserved the same 245,760 tokens/step. + +Command wrapper: + +```sh +scripts/run_v12_phase.sh start-pilot +scripts/run_v12_phase.sh start-full +scripts/run_v12_phase.sh status +scripts/run_v12_phase.sh eval-fixed +scripts/run_v12_phase.sh export +scripts/run_v12_phase.sh sample +``` + +## Gates + +### Gate 0: build and smoke + +Run: + +```sh +scripts/run_v12_phase.sh smoke +``` + +Pass criteria: + +- no CUDA OOM +- no NaN loss +- first 30 steps decrease from initialization +- peak memory leaves enough margin for eval + +### Gate 1: pilot + +Run: + +```sh +scripts/run_v12_phase.sh start-pilot +``` + +Default pilot is 300 steps with held-out eval every 100 steps. + +Pass criteria: + +- train loss decreases smoothly +- grad norm does not spike persistently +- moving-tail eval is finite and improving +- checkpoint can be reloaded by `eval-fixed` + +### Gate 2: full base + +Run only after the pilot passes: + +```sh +scripts/run_v12_phase.sh start-full +``` + +Pass criteria: + +- fixed eval data v1 at `seq1024` beats v11's **2.7467** +- generation samples improve or at least do not regress on repetition +- checkpoint exports and xserv loads the true GQA config + +### Gate 3: chat-alpha SFT + +After a healthy v12 base: + +1. Use assistant-only SFT (`--sft-tsv`) with English-only data. +2. Start from narrow anchors first, then mix in Smol-SmolTalk. +3. Judge with fixed generation prompts before calling it useful. + +The primary high-quality source remains `HuggingFaceTB/smol-smoltalk` filtered to English single-turn examples, with local anchors preserved to keep deterministic behavior. + +## Evaluation + +Base metrics: + +- moving-tail val during training +- fixed eval data v1 at `seq1024` +- xtrain fixed prompt samples from `scripts/chat_alpha_fixed_prompts.txt` +- xserv exported-model smoke + +Chat metrics: + +- fixed prompt answers for SFT explanation, SFT data provenance, arithmetic, refusal, repetition-debug checklist, summary, and simple code generation +- compare against `v11-chat-alpha-sft-v2-anchor` +- reject models that lower SFT validation loss but hallucinate more in fixed prompts + +## Artifacts + +Expected paths: + +```text +/dashscope-tmp/wjh/xtrain_v12/ +/dashscope-tmp/wjh/xtrain_v12/xtrain_v12_pilot.ckpt +/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt +/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx +``` + +## Results + +### Gate 0/1: smoke + pilot + +- `batch24/accum10` smoke OOMed after step 0. +- `batch16/accum15` smoke passed 10 steps: train loss **11.2347 -> 7.9459**, ~24.5K tok/s, ~29.4GB/GPU. +- 300-step pilot passed: train loss **11.2296 -> 5.4832**, val **6.5810 -> 5.9642 -> 5.5888**, exit code 0. +- Pilot checkpoint reload matched final val: fixed eval data v1 at seq1024 = **5.5891**. +- Fixed chat prompts still repeat heavily, as expected for a 300-step base; use them as a regression baseline, not as chat quality. + +### Gate 2: full base + +Full run completed on dash5: + +| item | result | +|------|--------| +| wall clock | **81h01m** | +| throughput | **~24.55K tok/s** | +| train loss | **11.2294 -> 2.6696** | +| moving-tail best val | **2.7411** | +| moving-tail final val | **2.7412** | +| fixed eval data v1, seq1024 reload | **2.7410** | +| exit code | **0** | + +Validation milestones: + +| step | 499 | 999 | 1499 | 1999 | 2499 | 21999 | 23999 | 25999 | 26999 | 27499 | final | +|------|-----|-----|------|------|------|-------|-------|-------|-------|-------|-------| +| val | 5.3029 | 4.4079 | 3.9287 | 3.6964 | 3.5555 | 2.7805 | 2.7637 | 2.7468 | 2.7443 | **2.7411** | 2.7412 | + +Compared with v11's fixed eval data v1 at seq1024 (**2.7467**), v12 reaches **2.7410** after reload. This is a real but very small gain +(~0.006 absolute), despite the parameter increase from 838.65M to 1.0506B total and the slower 24.55K tok/s throughput. The result says the +larger 1B-class base is viable and marginally better, but this scale step did not produce a qualitative base-model jump. + +Generation: + +- Raw FineWeb-style prompts are better than the pilot checkpoint and can produce plausible explanatory prose. +- Greedy repetition remains visible, especially on story-like prompts. +- Chat prompts are not reliable without SFT: SFT data provenance is hallucinated, arithmetic still fails, and the model repeats template-like text. +- xserv loads the export correctly as true GQA: `layers=22, hidden=1664, heads=52/13 kv`. + +Exported model: + +```text +/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx +``` + +Files: + +- `config.json` +- `model.safetensors` (2.0GB) +- `tokenizer.json` +- `xtrain.ckpt` (4.0GB) + +Conclusion: v12 passes the base gate and is a better SFT starting point than v11 by metric, but the gain is narrow. The next step should be +assistant-only chat SFT from v12 with conservative anchors first, then a small Smol-SmolTalk mix. Do not expect the base checkpoint itself to +serve as a usable chat model. diff --git a/docs/runs/13-v12-chat-sft-quality.md b/docs/runs/13-v12-chat-sft-quality.md new file mode 100644 index 0000000..9af781f --- /dev/null +++ b/docs/runs/13-v12-chat-sft-quality.md @@ -0,0 +1,180 @@ +# v12 Chat SFT Quality Check + +Date: 2026-06-29 + +## Goal + +Turn the completed v12 1.05B base checkpoint into a usable chat-alpha model with +SFT, then judge whether it is stable enough to call a high-quality chat model. + +Base checkpoint: + +```text +/dashscope-tmp/wjh/xtrain_v12/xtrain_v12.ckpt +``` + +Architecture: + +```text +dim=1664 layers=22 heads=52 kv_heads=13 head_dim=32 ffn=6656 +total params=1.0506B +``` + +## Stage A: Synthetic SFT + +Data: + +```text +/dashscope-tmp/wjh/xtrain_sft_alpha_v2/chat_alpha_v2_sft.tsv +211,257 examples, about 14.96M SFT tokens +``` + +Run: + +```text +/dashscope-tmp/wjh/xtrain_sft_v12_alpha_v2/chat_alpha_v12_v2.ckpt +``` + +Metrics: + +```text +train loss: 3.5730 -> 0.0426 +eval: step39 0.1078, step79 0.0582, step119 0.0466, step159 0.0423, + step199 0.0403, step239 0.0390, step279 0.0389, step319 0.0378 +best/final val loss: 0.0378 +``` + +Export: + +```text +/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2 +``` + +Quality notes: + +- Learns the User/Assistant format and usually stops correctly. +- Too narrow and template-heavy. +- Fails basic math and code prompts in fixed greedy evaluation. + +## Stage B: Anchor SFT + +Data: + +```text +/dashscope-tmp/wjh/xtrain_sft_alpha_v2_anchor/chat_alpha_v2_anchor.tsv +32,020 examples, about 1.73M SFT tokens +``` + +Run: + +```text +/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt +``` + +Metrics: + +```text +train loss: 1.7777 -> 0.1165 +eval: step19 0.3447, step39 0.1449, step59 0.1217, step79 0.1158 +best/final val loss: 0.1158 +``` + +Export: + +```text +/opt/wjh/projects/tiny-models/v12-chat-alpha-sft-v2-anchor +``` + +Generation artifacts: + +```text +/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_xserv_greedy.txt +/dashscope-tmp/wjh/xtrain_sft_v12_anchor/generation/anchor_diagnostic_greedy.txt +``` + +Quality notes: + +- Better project-context answers and summaries than synthetic-only. +- Still unreliable on basic multiplication, yes/no facts, translation, and code. +- Overuses "cannot verify" style answers outside appropriate uncertainty cases. + +## Stage C: Real-Mix Repair + +Data: + +```text +/dashscope-tmp/wjh/xtrain_sft_real_mix_v1/smol_smoltalk_real_mix.tsv +96,287 examples, about 25.3M SFT tokens +``` + +Run: + +```text +/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/chat_alpha_v12_real_mix_repair.ckpt +``` + +Training setup: + +```text +init=/dashscope-tmp/wjh/xtrain_sft_v12_anchor/chat_alpha_v12_anchor.ckpt +steps=200 +seq=512 +batch=32 +accum=8 +effective batch=256 +lr=1e-6 -> 2e-7 +``` + +Metrics: + +```text +train loss: 2.7391 -> 2.0384 +eval: step49 2.1964, step99 2.0383, step149 1.9801, step199 1.9570 +best/final val loss: 1.9570 +``` + +Export: + +```text +/opt/wjh/projects/tiny-models/v12-chat-alpha-real-mix-repair +``` + +Generation artifacts: + +```text +/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_xserv_greedy.txt +/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy.txt +/dashscope-tmp/wjh/xtrain_sft_v12_real_mix_repair/generation/real_mix_repair_diagnostic_greedy_reppenalty1.txt +``` + +Quality notes: + +- Loss improved cleanly and the model kept chat formatting. +- Fixed prompt math `17% of 240` improved in the standard suite. +- General diagnostic math still fails, e.g. `12 * 13`. +- Code generation remains unusable for simple Python function prompts. +- Some outputs contain corrupted or off-topic fragments. +- Reducing repeat penalty from 1.15 to 1.0 did not fix the failures. + +## Verdict + +The SFT pipeline works, and v12 can be turned into a chat-shaped model that follows +the prompt format and stops correctly. However, none of the three SFT variants is a +stable high-quality chat model yet. + +The limiting issue is no longer infrastructure. It is data and objective quality: +the current synthetic/anchor data is too narrow, while the current real-mix data +adds breadth but also noisy or low-quality behavior. Validation loss alone is not a +sufficient selection signal for chat quality. + +## Recommended Next Step + +Build a smaller, higher-precision SFT curriculum before another large run: + +1. Keep the anchor data, but reduce over-refusal templates. +2. Add verified small instruction sets for math, code, translation, summarization, + and closed-book common facts. +3. Add an automatic fixed-prompt eval harness that scores exact-match math, simple + code syntax, refusal appropriateness, stop-token behavior, and corruption. +4. Train a short curriculum from the v12 base or v12 anchor checkpoint, then pick + by generation eval rather than SFT loss alone. diff --git a/scripts/chat_alpha_fixed_prompts.txt b/scripts/chat_alpha_fixed_prompts.txt new file mode 100644 index 0000000..cddd787 --- /dev/null +++ b/scripts/chat_alpha_fixed_prompts.txt @@ -0,0 +1,10 @@ +# One escaped prompt per line. `greedy_sample` decodes literal \n before tokenizing. +User: Explain supervised fine-tuning to a junior engineer.\nAssistant: +User: What high-quality SFT data are we using now?\nAssistant: +User: What training data did chat-alpha-v1 use?\nAssistant: +User: What is 17% of 240?\nAssistant: +User: I found that my small language model repeats the same phrase during generation. What should I inspect first?\nAssistant: +User: Summarize this passage in one sentence: A team trained a base model, then continued with chat examples at a low learning rate. Validation loss improved, but they still need real prompt tests before calling it useful.\nAssistant: +User: Who will win the world championship in 2099?\nAssistant: +User: Give a compact checklist before launching an SFT run.\nAssistant: +User: Write a Python function that returns the larger of two numbers.\nAssistant: diff --git a/scripts/run_v12_phase.sh b/scripts/run_v12_phase.sh new file mode 100755 index 0000000..f0499f2 --- /dev/null +++ b/scripts/run_v12_phase.sh @@ -0,0 +1,329 @@ +#!/usr/bin/env bash +set -euo pipefail + +ROOT="${XTRAIN_ROOT:-$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)}" +cd "$ROOT" + +export PATH="/usr/local/cuda/bin:/opt/wjh/.cargo/bin:$PATH" + +strip_token_cache_suffix() { + local path="$1" + if [[ "$path" == *.u16.bin ]]; then + printf '%s\n' "${path%.u16.bin}" + else + printf '%s\n' "$path" + fi +} + +RUN_DIR="${RUN_DIR:-/dashscope-tmp/wjh/xtrain_v12}" +TOKENIZER="${TOKENIZER:-/opt/wjh/models/gpt2/tokenizer.json}" +CORPUS="${CORPUS:-data/fineweb-edu.txt}" +FIXED_EVAL="${FIXED_EVAL:-/dashscope-tmp/wjh/xtrain_fixed_eval_v1/fineweb-fixed-eval-v1.txt}" +EXPORT_DIR="${EXPORT_DIR:-/opt/wjh/projects/tiny-models/v12-fineweb-edu-1b-longctx}" +CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6,7}" +TMUX_SESSION="${TMUX_SESSION:-xtrain_v12}" + +HEADS="${HEADS:-52}" +HEAD_DIM="${HEAD_DIM:-32}" +KV_HEADS="${KV_HEADS:-13}" +LAYERS="${LAYERS:-22}" +FFN="${FFN:-6656}" +SEQ="${SEQ:-1024}" +BATCH="${BATCH:-16}" +ACCUM="${ACCUM:-15}" +MAX_LR="${MAX_LR:-4e-4}" +MIN_LR="${MIN_LR:-4e-5}" +VAL_TOKENS="${VAL_TOKENS:-1000000}" +EVAL_BATCHES="${EVAL_BATCHES:-64}" +FIXED_EVAL_SEQ="${FIXED_EVAL_SEQ:-1024}" +FIXED_EVAL_BATCHES="${FIXED_EVAL_BATCHES:-64}" +PILOT_STEPS="${PILOT_STEPS:-300}" +FULL_STEPS="${FULL_STEPS:-27524}" +PILOT_EVAL_EVERY="${PILOT_EVAL_EVERY:-100}" +FULL_EVAL_EVERY="${FULL_EVAL_EVERY:-500}" + +CORPUS="$(strip_token_cache_suffix "$CORPUS")" +FIXED_EVAL="$(strip_token_cache_suffix "$FIXED_EVAL")" + +ARCH_ARGS=( + --heads "$HEADS" + --head-dim "$HEAD_DIM" + --kv-heads "$KV_HEADS" + --layers "$LAYERS" + --ffn "$FFN" +) + +usage() { + cat <<'EOF' +usage: scripts/run_v12_phase.sh ACTION + +Actions: + build Build xtrain train/export/sample binaries. + smoke Run a short no-checkpoint v12 seq1024 smoke test in foreground. + pilot Run a 300-step v12 pilot with held-out eval and checkpoint. + full Run the full one-epoch v12 base training job. + eval-fixed Evaluate a checkpoint on fixed eval v1. + sample Run xtrain greedy_sample on fixed chat-alpha prompts. + export Export a checkpoint to xserv/tiny-models format. + status Print one progress snapshot from RUN_DIR/full.log or pilot.log. + monitor Show a refreshing progress dashboard until interrupted. + start-pilot Start pilot + monitor in tmux sessions. + start-full Start full train + monitor in tmux sessions. + +Environment overrides: + RUN_DIR, TOKENIZER, CORPUS, FIXED_EVAL, EXPORT_DIR, CUDA_VISIBLE_DEVICES + HEADS, HEAD_DIM, KV_HEADS, LAYERS, FFN, SEQ, BATCH, ACCUM + MAX_LR, MIN_LR, PILOT_STEPS, FULL_STEPS, FIXED_EVAL_SEQ +EOF +} + +build() { + cargo build --release -p xtrain-distributed --bin train_ddp + cargo build --release -p xtrain-train --bin train --bin export_safetensors --bin greedy_sample +} + +write_meta() { + local kind="$1" + mkdir -p "$RUN_DIR" + { + echo "run=$kind" + echo "created_utc=$(date -u '+%Y-%m-%dT%H:%M:%SZ')" + echo "arch=heads${HEADS}_hd${HEAD_DIM}_kv${KV_HEADS}_layers${LAYERS}_ffn${FFN}" + echo "seq=$SEQ" + echo "batch=$BATCH" + echo "accum=$ACCUM" + echo "effective_batch=$((BATCH * ACCUM))" + echo "tokens_per_step=$((BATCH * ACCUM * SEQ))" + echo "max_lr=$MAX_LR" + echo "min_lr=$MIN_LR" + echo "corpus=$CORPUS" + echo "fixed_eval=$FIXED_EVAL" + echo "fixed_eval_seq=$FIXED_EVAL_SEQ" + } > "$RUN_DIR/META.txt" +} + +write_env_file() { + mkdir -p "$RUN_DIR" + local env_file="$RUN_DIR/env.sh" + : > "$env_file" + local names=( + XTRAIN_ROOT RUN_DIR TOKENIZER CORPUS FIXED_EVAL EXPORT_DIR CUDA_VISIBLE_DEVICES + TMUX_SESSION HEADS HEAD_DIM KV_HEADS LAYERS FFN SEQ BATCH ACCUM MAX_LR MIN_LR + VAL_TOKENS EVAL_BATCHES FIXED_EVAL_SEQ FIXED_EVAL_BATCHES PILOT_STEPS + FULL_STEPS PILOT_EVAL_EVERY FULL_EVAL_EVERY + ) + for name in "${names[@]}"; do + if [[ "$name" == "XTRAIN_ROOT" ]]; then + printf 'export XTRAIN_ROOT=%q\n' "$ROOT" >> "$env_file" + else + printf 'export %s=%q\n' "$name" "${!name}" >> "$env_file" + fi + done +} + +run_train() { + local kind="$1" + local steps="$2" + local eval_every="$3" + local ckpt="$4" + local log="$RUN_DIR/${kind}.log" + write_meta "$kind" + echo "$steps" > "$RUN_DIR/${kind}.steps" + echo "$((BATCH * ACCUM * SEQ))" > "$RUN_DIR/${kind}.tokens_per_step" + { + echo "RUN_NAME=xtrain_v12_${kind}" + echo "RUN_START_ISO=$(date -u '+%Y-%m-%dT%H:%M:%SZ')" + echo "RUN_START_EPOCH=$(date +%s)" + echo "CKPT=$ckpt" + echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" + echo "TOTAL_STEPS=$steps" + echo "TOKENS_PER_STEP=$((BATCH * ACCUM * SEQ))" + set -x + set +e + if [[ -n "$ckpt" ]]; then + CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES" target/release/train_ddp \ + "$TOKENIZER" "$CORPUS" \ + "${ARCH_ARGS[@]}" \ + --steps "$steps" --batch "$BATCH" --accum-steps "$ACCUM" --seq "$SEQ" \ + --max-lr "$MAX_LR" --min-lr "$MIN_LR" \ + --val-tokens "$VAL_TOKENS" --eval-every "$eval_every" --eval-batches "$EVAL_BATCHES" \ + --bf16 --recompute --flash --dropout 0.0 \ + --ckpt "$ckpt" + rc=$? + else + CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES" target/release/train_ddp \ + "$TOKENIZER" "$CORPUS" \ + "${ARCH_ARGS[@]}" \ + --steps "$steps" --batch "$BATCH" --accum-steps "$ACCUM" --seq "$SEQ" \ + --max-lr "$MAX_LR" --min-lr "$MIN_LR" \ + --val-tokens 0 --eval-every 0 --eval-batches "$EVAL_BATCHES" \ + --bf16 --recompute --flash --dropout 0.0 + rc=$? + fi + set -e + set +x + echo "RUN_END_ISO=$(date -u '+%Y-%m-%dT%H:%M:%SZ')" + echo "RUN_EXIT_CODE=$rc" + exit "$rc" + } 2>&1 | tee "$log" +} + +checkpoint_path() { + local preferred="$RUN_DIR/xtrain_v12.ckpt" + local pilot="$RUN_DIR/xtrain_v12_pilot.ckpt" + if [[ -n "${CKPT:-}" ]]; then + echo "$CKPT" + elif [[ -f "$preferred" ]]; then + echo "$preferred" + else + echo "$pilot" + fi +} + +eval_fixed() { + local ckpt + ckpt="$(checkpoint_path)" + target/release/train \ + "$TOKENIZER" "$FIXED_EVAL" \ + "${ARCH_ARGS[@]}" \ + --seq "$FIXED_EVAL_SEQ" --batch 1 --steps 1 \ + --val-tokens "$VAL_TOKENS" --eval-batches "$FIXED_EVAL_BATCHES" \ + --bf16 --recompute --flash \ + --eval-ckpt "$ckpt" \ + 2>&1 | tee "$RUN_DIR/eval_fixed.log" +} + +sample_fixed() { + local ckpt + ckpt="$(checkpoint_path)" + target/release/greedy_sample \ + "$ckpt" "$TOKENIZER" \ + "${ARCH_ARGS[@]}" \ + --max-tokens "${MAX_TOKENS:-120}" \ + --temperature "${TEMPERATURE:-0}" \ + --prompts-file "${PROMPTS_FILE:-scripts/chat_alpha_fixed_prompts.txt}" \ + 2>&1 | tee "$RUN_DIR/sample_fixed.log" +} + +export_model() { + local ckpt + ckpt="$(checkpoint_path)" + rm -rf "$EXPORT_DIR" + target/release/export_safetensors \ + "$ckpt" "$TOKENIZER" "$EXPORT_DIR" \ + "${ARCH_ARGS[@]}" + cp "$ckpt" "$EXPORT_DIR/xtrain.ckpt" + echo "$EXPORT_DIR" | tee "$RUN_DIR/export_path.txt" +} + +progress_once() { + local log="${1:-$RUN_DIR/full.log}" + [[ -f "$log" ]] || log="$RUN_DIR/pilot.log" + python3 - "$log" <<'PY' +import os, re, sys, time +log = sys.argv[1] +text = open(log, errors="ignore").read() if os.path.exists(log) else "" +steps = re.findall(r"\[rank0\] step\s+(\d+)/(\d+): loss\s+(\S+) lr\s+(\S+) gnorm\s+(\S+) \((\S+) tok/s global", text) +evals = re.findall(r"eval @ step\s+(\d+): val loss\s+(\S+)( \(best\))?", text) +start = re.search(r"RUN_START_EPOCH=(\d+)", text) +tokens_per_step = re.search(r"TOKENS_PER_STEP=(\d+)", text) +tokens_per_step = int(tokens_per_step.group(1)) if tokens_per_step else 245760 +exit_code = re.search(r"RUN_EXIT_CODE=(\d+)", text) +warnings = re.findall(r"(?i)(nan|inf|oom|out of memory|panic|error)", text) +print("xtrain v12 |", time.strftime("%Y-%m-%d %H:%M:%S %Z"), "| log:", log) +if warnings: + print("WARNING: suspicious log tokens:", ", ".join(sorted(set(w.lower() for w in warnings))[:8])) +if not steps: + print("waiting for first rank0 step") +else: + s, total, loss, lr, gnorm, tps = steps[-1] + done = int(s) + 1 + total = int(total) + pct = min(100.0, done * 100.0 / total) + width = 44 + fill = int(width * pct / 100.0) + bar = "#" * fill + "." * (width - fill) + try: + tpsf = float(tps) + except ValueError: + tpsf = 0.0 + elapsed = time.time() - int(start.group(1)) if start else None + eta = (total - done) * tokens_per_step / tpsf if tpsf > 0 else None + def fmt(sec): + if sec is None: + return "n/a" + sec = int(max(0, sec)) + h, r = divmod(sec, 3600) + m, s = divmod(r, 60) + return f"{h:02d}:{m:02d}:{s:02d}" + print(f"[{bar}] {pct:6.2f}%") + print(f"step {done}/{total} | loss {loss} | lr {lr} | gnorm {gnorm}") + print(f"speed {tpsf:,.0f} tok/s | elapsed {fmt(elapsed)} | ETA {fmt(eta)}") +if evals: + s, v, best = evals[-1] + best_vals = [] + for _, vv, mark in evals: + if not mark: + continue + try: + best_vals.append(float(vv)) + except ValueError: + pass + best_txt = f"best {min(best_vals):.4f}" if best_vals else "best n/a" + try: + val_txt = f"{float(v):.4f}" + except ValueError: + val_txt = v + print(f"eval step {int(s)+1}: val {val_txt} {best.strip()} | {best_txt}") +else: + print("eval: waiting") +if exit_code: + print("FINISHED exit code", exit_code.group(1)) +PY + echo + nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv,noheader,nounits \ + | awk -F, '{printf "gpu%s %sMiB %s%% ", $1, $2, $3} NR%4==0{print ""} END{print ""}' + df -h /dashscope-tmp | awk 'NR==2{print "Disk: "$4" free ("$5" used)"}' +} + +monitor() { + while true; do + clear + progress_once + sleep "${MONITOR_INTERVAL:-30}" + done +} + +start_tmux() { + local kind="$1" + local session="$TMUX_SESSION" + if tmux has-session -t "=${session}" 2>/dev/null; then + echo "tmux session already exists: $session" + echo "attach: tmux attach -t $session" + exit 1 + fi + write_env_file + tmux new-session -d -s "$session" "bash -lc 'source \"$RUN_DIR/env.sh\" && cd \"$ROOT\" && scripts/run_v12_phase.sh $kind'" + if ! tmux has-session -t "=${session}_mon" 2>/dev/null; then + tmux new-session -d -s "${session}_mon" "bash -lc 'source \"$RUN_DIR/env.sh\" && cd \"$ROOT\" && scripts/run_v12_phase.sh monitor'" + fi + echo "started $kind in tmux: $session" + echo "monitor: tmux attach -t ${session}_mon" +} + +action="${1:-}" +case "$action" in + build) build ;; + smoke) build; run_train smoke "${SMOKE_STEPS:-30}" 0 "" ;; + pilot) build; run_train pilot "$PILOT_STEPS" "$PILOT_EVAL_EVERY" "$RUN_DIR/xtrain_v12_pilot.ckpt" ;; + full) build; run_train full "$FULL_STEPS" "$FULL_EVAL_EVERY" "$RUN_DIR/xtrain_v12.ckpt" ;; + eval-fixed) build; eval_fixed ;; + sample) build; sample_fixed ;; + export) build; export_model ;; + status) progress_once ;; + monitor) monitor ;; + start-pilot) start_tmux pilot ;; + start-full) start_tmux full ;; + ""|-h|--help|help) usage ;; + *) echo "unknown action: $action" >&2; usage >&2; exit 2 ;; +esac