xtrain

Files

Gahow Wang fbf4ac2917 sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path
used by the v12 SFT runs:

- cross_entropy ignores negative targets (-100 ignore-index), normalizing by
  valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu).
- Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is
  formatted as 'User: .. \nAssistant:' + answer + <|endoftext|>, prompt tokens
  masked to -100 while answer+EOS are supervised; i32 label cache alongside the
  u16 token cache; sample() retries windows that are fully masked; eval uses
  target_window so masking applies to val loss too (data.rs, train_loop.rs).
- train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues
  training from a base checkpoint.
- greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt
  generation eval.

Test fixtures updated for the new Corpus.labels field; dropout.rs carries
incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout);
correctness rests on the documented v12 base+SFT runs on the GPU box.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-29 16:19:02 +08:00

batched.rs

model: batched forward [B,S]

2026-06-16 00:44:25 +08:00

bf16.rs

test: bf16 test reads f32-cast logits (forward now returns bf16)

2026-06-16 14:29:24 +08:00

dropout.rs

sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval

2026-06-29 16:19:02 +08:00

flash.rs

test: flash==composed bf16 uses robust mean/p99 metric (repo convention)