xtrain

Author	SHA1	Message	Date
Gahow Wang	6465a2d5ce	test: T21-for-proc — clear ENV_DROPOUT across tests to sever ordering coupling libtest with --test-threads=1 (the documented invariant for this file's DDP tests) runs tests alphabetically. The new proc_per_gpu_dropout_is_live_and_p0_matches_no_dropout ('d') runs BEFORE proc_per_gpu_matches_single_gpu_and_thread_path ('m'). It sets ENV_DROPOUT=0.2 via std::env::set_var; if left in place, the correctness test's spawned workers would inherit it (Command inherits parent env by default) and build with cfg.dropout=0.2 while its single-GPU baseline (run_single_gpu → test_config → dropout=0) stays at 0 — GATE (a) `max_rel_single < 1e-3` would blow up by orders of magnitude. Two defenses: - correctness test remove_var(ENV_DROPOUT) before spawn (belt): even if the dropout test forgot to clean up, this test starts from a clean env. - dropout test remove_var(ENV_DROPOUT, ENV_DUMP_DIR) at exit (suspenders): keep the invariant "each test leaves the env as it found it" so any future test added after these two starts clean too. Same --test-threads=1 SAFETY comment applies (no concurrent env access).	2026-07-01 14:09:42 +08:00
Gahow Wang	33a1aee9ec	test: T21-for-proc — dropout-live regression under process-per-GPU Analogue of the ddp_dropout_is_live_and_p0_bit_identical test (T21, thread-per- GPU) for the process-per-GPU launcher. Runs launch_processes twice on the same corpus / init / config with the ONLY difference being cfg.dropout (passed launcher→worker via a new XTRAIN_TEST_DROPOUT env — worker re-execs cannot inherit argv changes), reads rank 0's loss trajectory from both runs, and asserts GATE B: max \|loss diff\| > 1e-3. The threshold sits ~4 orders of magnitude above this box's KI-5 cross-rank NCCL noise floor (~1e-7), so it is an unambiguous "dropout mask is applied" signal, not a noise measurement. Pre-fix (missing cfg.dropout = ... in the worker / launcher, exactly the gap the paired launcher commit closes) both traces are bit-identical and this test FAILs. Also wires ENV_DROPOUT into the shared worker entry so the existing correctness test's contract is unchanged (absent env → 0.0 → same synth run as before). p0/ and p02/ subdirs isolate the two invocations' dumps.	2026-07-01 13:51:31 +08:00
Gahow Wang	fbf4ac2917	sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path used by the v12 SFT runs: - cross_entropy ignores negative targets (-100 ignore-index), normalizing by valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu). - Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is formatted as 'User: .. \nAssistant:' + answer + <\|endoftext\|>, prompt tokens masked to -100 while answer+EOS are supervised; i32 label cache alongside the u16 token cache; sample() retries windows that are fully masked; eval uses target_window so masking applies to val loss too (data.rs, train_loop.rs). - train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues training from a base checkpoint. - greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt generation eval. Test fixtures updated for the new Corpus.labels field; dropout.rs carries incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout); correctness rests on the documented v12 base+SFT runs on the GPU box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:02 +08:00
Gahow Wang	4abb17383a	test: process-per-GPU DDP correctness (ddp_proc.rs) Self-launching test: worker mode (XTRAIN_RANK set) trains on synthetic corpus and dumps loss+params; launcher mode runs single-GPU baseline + thread-per-GPU launch + spawns 2 worker processes, then asserts (a) proc loss == single-GPU <1e-3, (b) cross-rank params <1e-6 (KI-5 ULP), (c) proc loss == thread-per-GPU <1e-3. Run with --test-threads=1 (distributed harness property). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:52 +08:00

4 Commits