- run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
11 lines
896 B
Plaintext
11 lines
896 B
Plaintext
# One escaped prompt per line. `greedy_sample` decodes literal \n before tokenizing.
|
|
User: Explain supervised fine-tuning to a junior engineer.\nAssistant:
|
|
User: What high-quality SFT data are we using now?\nAssistant:
|
|
User: What training data did chat-alpha-v1 use?\nAssistant:
|
|
User: What is 17% of 240?\nAssistant:
|
|
User: I found that my small language model repeats the same phrase during generation. What should I inspect first?\nAssistant:
|
|
User: Summarize this passage in one sentence: A team trained a base model, then continued with chat examples at a low learning rate. Validation loss improved, but they still need real prompt tests before calling it useful.\nAssistant:
|
|
User: Who will win the world championship in 2099?\nAssistant:
|
|
User: Give a compact checklist before launching an SFT run.\nAssistant:
|
|
User: Write a Python function that returns the larger of two numbers.\nAssistant:
|