xtrain

Author	SHA1	Message	Date
Gahow Wang	1574e21d89	post-train: M1 — verifiable-arith eval scorer + SFT format-baseline result eval_arith: load ckpt, greedy-generate per held-out prompt, parse \boxed{} via the shared task checker, report format(boxed) + correctness pass-rates. Reused as the verifiable-eval harness for M3 (DPO) / M4 (GRPO). M1 result (100 held-out prompts, v12 1.05B base): SFT moves answer-format adherence 0% -> 100%, arithmetic correctness 8% -- the intended split (SFT buys the format; correctness is the verifiable-reward job of M3/M4). Logged in docs/18 implementation log + a Phase-3 row in docs/evolution.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 11:13:19 +08:00
Gahow Wang	cb64604496	post-train: M1 fix — enlarge arith key space + saturation guard The default operand ranges (max_add=99, max_mul=12) gave only ~20k unique problems, so 'gen_arith_task --n 20000 --eval 500' (a) made train dedup pathologically slow near saturation and (b) made the disjoint-eval loop never terminate. A background run stalled after ~10k train rows with no eval files. Fix (root cause, not a workaround): - enlarge default ranges to max_add=999, max_mul=99 (~2.01M key space) so 20k+ requests are a tiny fraction and dedup stays trivial; - add unique_space() + a generator guard that errors clearly when n+eval exceeds 80% of the key space, instead of looping forever. Verified: cargo test 10/10; full 20000/500 gen now 0.2s, all 3 files, 0 train/eval leakage; guard panics on an oversized (--max-add 99) request. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 23:28:25 +08:00
Gahow Wang	9c70e99ae4	post-train: M1 — verifiable arithmetic task + SFT data generator First post-training milestone (docs/18). Lands the verifiable task + its data pipeline, all verified host-side (no CUDA); the SFT run itself reuses the existing --sft-tsv path on the GPU box. - task.rs: the shared task spec — two-operand integer arithmetic, answer in \boxed{N}, with parse_boxed_answer + check_answer (exact-match rule-based reward). One module reused by M1 (SFT data), M3 (DPO pairs), M4 (GRPO reward). - gen_arith_task bin: writes arith_sft.tsv (--sft-tsv format) + held-out arith_eval_prompts.txt (greedy_sample format) + arith_eval_gold.txt; train deduped, eval disjoint from train. - data.rs: extract assistant-only masking into a pure, testable sft_row() (behavior-preserving; single-turn bit-identical to `fbf4ac2`). Gate (verified locally, no_cuda): cargo test -p xtrain-train --lib = 9/9 pass (masking, SFT-target self-consistency over 2000 samples, parser edges, seed determinism); a 200/50 gen run = clean 2-col TSV, correct gold incl. negatives, 0 train/eval leakage. SFT training run + format-eval pending on dash5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 22:52:25 +08:00
Gahow Wang	ab32168dcc	docs: post-training stack design — SFT → KV-cache → DPO → GRPO (docs/18) Design doc for a from-scratch post-training infra on top of xtrain. Ladder: SFT (have it) → DPO → reward model (optional) → GRPO, each rung one new post-training systems concept + a hard correctness gate (grad-check, PyTorch parity, degenerate checks, a falsifiable 'it learns' signal). Decisions aligned with the user (D1-D4): - D1 scope: DPO → GRPO, reward model optional. - D2 reward: rule-based / verifiable first; learned RM deferred. - D3 rollout: build the KV-cache incremental-decode engine UP FRONT (not naive-first) as the foundational milestone before DPO/GRPO. - D4 task: a verifiable task (arithmetic/format) with deterministic exact-match reward, for a clean RL signal. Locked milestone order: M1 SFT task baseline → M2 KV-cache decode engine (token-identical gate) → M3 DPO → M4 GRPO → M5 optional reward model. Status: design only, no implementation yet. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 22:44:25 +08:00
Gahow Wang	7a1fba95b5	docs: v12 — 1.05B long-ctx base + chat-alpha SFT quality check - run 12: dim1664/22L true-GQA 1.05B base, seq1024, 6.765B FineWeb tokens, 81h on 8x5090. Fixed eval v1 @seq1024 = 2.7410 vs v11 2.7467 — a real but marginal gain; v11->v12 is a capacity-only step on fixed data, so the ~0.2% return confirms the 1B base is now data-limited. - run 13: three SFT stages from the v12 base (synthetic / anchor / real-mix-repair). The pipeline works and produces a chat-shaped model that follows the format and stops, but none of the variants is a stable high-quality chat model — bottleneck is SFT data quality + selection signal (val loss decouples from generation quality), not infra. - scripts/run_v12_phase.sh wrapper + chat_alpha_fixed_prompts.txt eval set. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:12 +08:00
Gahow Wang	fbf4ac2917	sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path used by the v12 SFT runs: - cross_entropy ignores negative targets (-100 ignore-index), normalizing by valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu). - Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is formatted as 'User: .. \nAssistant:' + answer + <\|endoftext\|>, prompt tokens masked to -100 while answer+EOS are supervised; i32 label cache alongside the u16 token cache; sample() retries windows that are fully masked; eval uses target_window so masking applies to val loss too (data.rs, train_loop.rs). - train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues training from a base checkpoint. - greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt generation eval. Test fixtures updated for the new Corpus.labels field; dropout.rs carries incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout); correctness rests on the documented v12 base+SFT runs on the GPU box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:02 +08:00
Gahow Wang	5c27493a90	docs: backfill v9/v10 scaling runs + reframe README to v0–v10 / three phases Add per-run design+result docs for the two Chinchilla-axis runs that were done but never committed: - v9 (dim1280 true-GQA, core 357M, 6.01B FineWeb tokens): double-axis scale, best moving-tail val 2.8854 (~3.2% below v8) — direction validated, gain still incremental, greedy repetition remains. - v10 (same arch, data-only top-up to 6.765B): moving-tail 2.8816; fixed eval v1 v6→v10 = 3.2328/3.1850/3.1515/2.9278/2.8814. Extend the comparison tables in docs/runs/README.md and docs/evolution.md to v10, and reframe README to v0–v10 with Phase 3 = the v9 double-axis run. No code changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:18:48 +08:00
Gahow Wang	a1370446fe	docs: T21 — record DDP-dropout wiring gap + fix (known-issues / evolution / dropout doc) - known-issues.md: new "DDP-dropout wiring" Fixed entry (gap + fix + regression test), with the meta-lesson that op/single-GPU unit tests can miss launcher-level integration gaps — only the V9-PILOT end-to-end run on the real launcher path exposed it. - 17-dropout.md: annotate the DDP-combination note with the T18 wiring gap and its T21 fix. - evolution.md: T21 row (Infra) recording the fix + meta-lesson. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:22:49 +08:00
Gahow Wang	980605474b	test: T21 — DDP-dropout regression (live under DDP + p=0 bit-identical) Adds ddp_dropout_is_live_and_p0_bit_identical, run via the real launcher path (DdpContext::init + train_rank). It would have caught the original bug: - GATE A (world=1, ONE step — the deterministic scope): the p=0 FORWARD is byte-identical to no-dropout (ops::dropout(p=0) is a graph no-op) so the step loss is BIT-IDENTICAL (== 0.0). At world=1 the NCCL all-reduce short-circuits and one step has no optimizer-state compounding; the only residual non-determinism is the engine's atomicAdd backward-reduction order (the documented fresh-train md5 caveat — dropout-independent), so the post-step params are checked against that tight ULP floor (< 1e-7). - GATE A2 (world=2): p=0 matches a separate no-dropout baseline within NCCL's run-to-run ULP noise (< 1e-6, KI-5 — the all-reduce is not bit-reproducible on this PCIe box). Enabling dropout=0 doesn't perturb the DDP path beyond it. - GATE B (world=2): a p=0.2 run's loss trace DIFFERS by > 1e-3 from p=0 — orders of magnitude above every noise floor here (~3e-2 observed). On the pre-T21 code the model stays in eval mode, so p=0.2 would be an identity and the trace would match p=0 at the noise floor — this gate fails. (Verified by simulating the bug: with model.train() removed, GATE B drops to 2.4e-7.) - GATE C: a dedicated no-eval run ends with model.is_training() == true, direct proof that train_rank called model.train(). - p>0 run is finite (no NaN/Inf). eval_every < steps so a periodic eval fires mid-run (flipping to eval mode), exercising the per-step model.train() restore discipline the pilot called out. Run with --test-threads=1 like the other DDP tests (shared-GPU deadlock). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:22:49 +08:00
Gahow Wang	81f3cf59e5	distributed: T21 — wire dropout into the DDP path (--dropout + model.train()) V9-PILOT caught a launcher-level integration gap: T18 wired dropout into the single-GPU bin/train, but the DDP path never did. train_ddp had no --dropout flag and never set cfg.dropout, and ddp.rs::train_rank never called model.train() — so under DDP every forward ran in the default eval mode and dropout was a silent identity, regardless of config. Fix, mirroring the single-GPU train/eval discipline: - train_ddp.rs: add a --dropout <p> flag (default 0 = off, matching the prior behavior) and set cfg.dropout from it; log it when on. - ddp.rs::train_rank: call model.train() at the start of each step (before the micro-batch loop). eval_loss() flips the model to eval mode and does not restore it, so re-asserting train() each step keeps dropout live across eval boundaries. --dropout 0 (default) is bit-identical to the prior DDP path: cfg.dropout stays 0 and ops::dropout(p=0) is a clone no-op regardless of training mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 21:08:17 +08:00
Gahow Wang	db70abe450	docs: T20 — Phase-2 systems-depth capstone (reframe README to two phases) Re-conclude xtrain as TWO phases now that Phase-2 (T14–T18) is merged on main: README.md - Status header: "complete (T1–T13) + scaling v0–v8" → "complete — two phases" (Phase 1 = from-scratch stack T1–T13 + v0–v8 scaling study; Phase 2 = the five deferred systems-stack features T14–T18). - Crate table: note the Phase-2 additions (fused flash-attn + repeat_kv + dropout in autodiff; GQA + dropout in model; grad-accum in train; process-per-GPU launcher in distributed). - Build-journey section retitled Phase 1 + Phase 2; replaced the run-on T14–T18 prose with a structured "## Phase 2" summary (5 features + honest results: flash = mem-not-walltime win, GQA group-sum backward, grad-accum −74% mem, dropout × recompute bit-exact, T17 throughput-neutral falsification). - Engineering lessons: T17 added as the THIRD profile-first falsification; reinforced honest-correctness with the Phase-2 hard gates + md5 b04fc9f9. - Doc index: doc range …14-* → …17-*; KI status line (process-per-GPU CLOSED, KI-4 accepted tradeoff). docs/evolution.md - New "三·五、Phase 2 systems-depth synthesis": ties the 5 features into the per-axis (算法/架构/Infra/数据) narrative + the two integration notes. docs/known-issues.md - KI-4 reframed as a deliberately-accepted modeling tradeoff (保 xserv closed loop; T19 DROPPED), not "open". - New integration notes: (a) DDP tests need --test-threads=1 (parallel deadlock); (b) fresh-train md5 is non-deterministic (atomicAdd reduction order) → the valid determinism gate is export re-determinism, not fresh-train reproduction. - (process-per-GPU item was already CLOSED=measured no-op in T17.) Docs-only; no code touched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 18:11:47 +08:00
Gahow Wang	71b0a1621f	docs: T17 process-per-GPU results — measured throughput-neutral Records the key empirical finding: process-per-GPU is statistically identical to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all 8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the old KI-5/T11 note speculated — measurement falsifies that hypothesis (same methodology as T11 falsifying "bucket the all-reduce"). Correctness all green: proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5 b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op); default training path unchanged. evolution.md Infra row + README T17 row + known-issues entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 18:03:14 +08:00
Gahow Wang	4abb17383a	test: process-per-GPU DDP correctness (ddp_proc.rs) Self-launching test: worker mode (XTRAIN_RANK set) trains on synthetic corpus and dumps loss+params; launcher mode runs single-GPU baseline + thread-per-GPU launch + spawns 2 worker processes, then asserts (a) proc loss == single-GPU <1e-3, (b) cross-rank params <1e-6 (KI-5 ULP), (c) proc loss == thread-per-GPU <1e-3. Run with --test-threads=1 (distributed harness property). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:52 +08:00
Gahow Wang	a188c8a277	distributed: train_ddp_mp bin (process-per-GPU launcher/worker) Dual-mode binary self-detecting via XTRAIN_RANK: launcher spawns one worker per visible GPU forwarding full argv; worker rebuilds config from argv and runs run_worker. CLI flags identical to train_ddp (thread-per-GPU, kept), so it doubles as the before->after throughput driver. thread-per-GPU path untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:52 +08:00
Gahow Wang	ffd548b80b	distributed: process-per-GPU launcher + worker (proc.rs) torchrun-style process-per-GPU: launch_processes spawns one worker process per GPU (re-exec current_exe with XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID} env), mints the ncclUniqueId once in the launcher and hex-injects it via env (no shared FS/TCP, race-free). worker_env/run_worker read the env, bind the device (own CUDA context), DdpContext::init + build_model + train_rank reused from T8 UNCHANGED. hex_encode/decode_unique_id are host-testable pure fns. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:48:43 +08:00
Gahow Wang	c470c627a7	docs: Phase T17 — process-per-GPU DDP design torchrun-style: launcher spawns N worker processes, each with its own CUDA context; cross-process ncclUniqueId distributed via launcher-minted hex env injection (race-free, no shared FS / TCP); train_rank + grad all-reduce reused unchanged. Keeps thread-per-GPU path as regression baseline. ZeRO-1 dropped (user scope decision). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 17:44:38 +08:00
Gahow Wang	2ff4573a31	docs: T15 GQA results + evolution row (模型架构) + README build-journey row Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row + cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc index range (00..14). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:44:58 +08:00
Gahow Wang	39df0b40c1	gqa: fix kv-proj shape test param indices (embed,attn_norm precede wq) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:38:42 +08:00
Gahow Wang	830d06ad01	gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) - repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each kv head sums its group of query-head grads; no atomics) + Tensor/ops node. - Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim; attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical. - --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export writes real num_key_value_heads (xserv repeat_kv grouping aligned). - Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape); parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:37:37 +08:00
Gahow Wang	62b1cb5dc7	docs: Phase T15 — GQA design (repeat_kv broadcast op + backward grad-sum) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:30:34 +08:00
Gahow Wang	4b6d3e0a79	test: flash+dropout cross-feature grad-check (Phase-2 integration) Add flash_plus_dropout_grad_check_fp32 to xtrain-model dropout tests: the two orthogonal Phase-2 features (T14 flash-attn, T18 dropout) in the same model must still grad-check. Both models run train-mode p=0.2 (identical masks, seed is flash-independent) so the only delta is the SDPA reduction order — checked against the flash-vs-composed tolerance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:43:54 +08:00
Gahow Wang	c36cdf74d1	Merge t18-dropout into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # crates/xtrain-autodiff/tests/autograd.rs # crates/xtrain-model/src/model.rs # crates/xtrain-train/src/bin/train.rs # crates/xtrain-train/src/train_loop.rs # docs/evolution.md	2026-06-18 00:41:41 +08:00
Gahow Wang	f26db882e5	Merge t16-grad-accum into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # docs/evolution.md	2026-06-18 00:37:11 +08:00
Gahow Wang	9e958cb0f9	Merge t14-flash-attention into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:35:46 +08:00
Gahow Wang	80fafa1914	docs: T18 evolution row + README build-journey row (dropout) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:06:06 +08:00
Gahow Wang	e625aa05dd	dropout: wire into model (residual sites) + train/eval switch + flag (T18) Config.dropout (default 0). TinyTransformer gets a Cell<bool> training switch (train()/eval()/with_training, default eval = safe) + a Cell<u64> step_seed bumped once per training forward. forward_batched derives a per-layer block_seed (pure fn of step_seed×layer) and block_forward derives two per-site seeds, inserting ops::dropout at the attn and ffn sub-block outputs (before each residual). The seed is a pure function of (step_seed, layer, site) so the checkpoint (T13) recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout node → graph bit-identical to pre-T18. train_loop: model.train() per step (restored after eval flips to eval); eval_loss runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in eval (default), so exported weights are dropout-free (xserv closed loop unaffected). Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads); eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout grads match non-recompute (fp32 + bf16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:32 +08:00
Gahow Wang	5eb27783f8	dropout: autodiff op + fixed-seed grad-check (T18) ops::dropout(x,p,seed): fwd runs Tensor::dropout, caches the mask in the backward closure, bwd pushes dx=d⊙mask. p==0 returns x.clone() (no node) so the default graph is unchanged. Tests in autograd.rs: fixed-seed finite-diff grad-check (mask held constant across the ± perturbation — dropout is a fixed elementwise linear map of x); E[out]≈input + keep-rate≈1-p over a seed sweep; p=0 kernel identity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:32 +08:00
Gahow Wang	1fdd0c5002	dropout: device RNG kernel + Tensor fwd/bwd (T18) csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32 uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer (per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16 activation variants (mask fp32 in both; uniform is dtype-independent so masks match across precisions). Stateless → re-run with same seed = same mask (T13 recompute-safe). Registered in build.rs + FFI decls. Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the launches (contiguous F32/BF16, default stream, per-op sync via the kernels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:18 +08:00
Gahow Wang	6b8c1e4e0f	docs: Phase T18 — dropout design (device RNG + mask) Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p) scaling at train, identity at eval. New autodiff `dropout` op (fwd generates + applies mask, bwd applies the SAME cached mask). Wired at the two residual-path sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic seed (so T13 recompute reproduces the same mask), train/eval switch, p=0 bit-identity, and the acceptance gates. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:08 +08:00
Gahow Wang	8bd7db16e1	docs: T16 grad-accum results — evolution row + README build-journey dash5-verified gate numbers: accum=N bit-close to N× big batch (loss 8.5e-8 / grad 3.8e-5), accum=1 bit-identical (0.0), DDP+accum matches single-GPU (5.7e-7), memory flat (same effective batch 64: 27.7GB big → 7.2GB accum, −74%), xserv closed loop md5-identical + token-identical. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:52:32 +08:00
Gahow Wang	b06b553f99	test: drop unused Var import in grad_accum Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:49:04 +08:00
Gahow Wang	abe5ceb913	test: grad-accum equivalence + accum=1 bit-identity + DDP+accum - grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch; accum_steps=1 bit-identical (max\|Δ\|==0) to no-accum; real train() loop with accum tracks a big-batch baseline over 20 AdamW steps. - ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of the same effective size (loss + cross-rank + vs-baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:40 +08:00
Gahow Wang	7a03b0054a	train+ddp: micro-batch gradient accumulation (--accum-steps) Accumulate grads over N micro-batches, then one AdamW step + zero_grad, for an effective batch of N×micro at one micro-batch's activation cost. Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates the scaled grads) so the boundary grad equals a single step over an N× batch. accum==1 skips the scale → bit-identical to the pre-T16 path. DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary (intermediate micro-steps are local-only, no NCCL); the /world average is orthogonal to the per-micro 1/N, so the boundary grad is the effective global-batch mean. New --accum-steps flag in both train binaries; effective batch is printed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:33 +08:00
Gahow Wang	d01fec6639	docs: Phase T16 — gradient accumulation design Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:41:17 +08:00
Gahow Wang	9064ced4c2	docs: T14 flash-attention results + evolution/README rows Fill in the design doc's measured results (grad-check, flash==composed, PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to evolution.md (算法/Infra) and the README build-journey table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:34:10 +08:00
Gahow Wang	d217f4fbd3	perf: spread flash bwd dK/dV atomics across all threads Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:27:33 +08:00
Gahow Wang	4d7b69f8d4	perf: cache softmax weights in shared mem (drop hd× redundant expf) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:24:56 +08:00
Gahow Wang	9b05f4f93f	test: flash==composed bf16 uses robust mean/p99 metric (repo convention) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:19:08 +08:00
Gahow Wang	c0f0b67510	test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:44 +08:00
Gahow Wang	80602099dc	test: scale Q/K in flash grad-check for well-conditioned grads Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:04 +08:00
Gahow Wang	f38beb0346	test: flash finite-diff grad-check uses single-tile clean regime Match the trusted composed grad-check dims (seq=5<FA_TILE); the multi-tile online-softmax path is gated by flash_bwd_matches_composed_bwd (seq=40), sharper than finite-diff on the near-zero grads a long softmax produces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:16:20 +08:00
Gahow Wang	01fb22d114	test: flash bwd vs composed bwd (sharper than finite-diff) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:12:30 +08:00
Gahow Wang	5f3b81ac96	test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile) + flash_matches_composed_fwd. model/tests/flash.rs: flash==composed on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump: XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle (PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:39 +08:00
Gahow Wang	0e20821633	autodiff+model: flash-attention op + --flash opt-in wiring ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a use_flash bool + with_flash(bool) builder; the SDPA core in attention() picks ops::flash_attention vs ops::attention. flash threads through block_forward so the recompute (T13) segment also runs flash. Default off = composed path, graph unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:32 +08:00
Gahow Wang	326a6fadfe	cuda: fused flash-attention kernel (fwd + flash-style bwd) csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per query row, streams KV in tiles of 32, online softmax — running max/sum + rescaled V accumulator, causal mask inlined, never materializes the [bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N), saved for backward). flash-style bwd: recompute scores from Q/K/V + L, collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV atomicAdd across rows. Tensor::flash_attention / flash_attention_backward wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax policy as composed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:25 +08:00
Gahow Wang	65a2264227	docs: Phase T14 — fused flash-attention design Design doc for the hand-written single fused flash-attention kernel: online softmax tiled over KV, NEVER materializing the [bh,S,S] score matrix; flash-style backward (recompute scores from saved logsumexp + D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:16 +08:00
Gahow Wang	31cc2bf745	docs: capstone README — full-stack + scaling study (v0-v8) writeup Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:17:26 +08:00
Gahow Wang	511f35d40c	docs: run v8 — dim1024 capacity helps (val 2.98) v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute. 8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h. Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 / v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085 AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were partly capacity-limited, scaling capacity > repeating old data. But the gain is only ~3% (same magnitude as the data-axis single-step lever), and v8's val was still descending at the end (not saturated). Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6, capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress, scale capacity AND data together (Chinchilla, reproduced at toy scale). - docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples - docs/runs/README.md: +v8 row, v9 proposal - docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 15:12:01 +08:00
Gahow Wang	0150263055	perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16, batch32 seq256, steady-state): - Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL — fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within tol", exactly equal). Full suite green with recompute on/off; DDP loss-match 5.67e-7; DDP+recompute 2-rank descends 11.079→6.010. - dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s 39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band). - dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607 MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges. → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed. Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:50:29 +08:00
Gahow Wang	69c5f07359	docs: Phase T13 — activation recompute Design doc for per-block gradient checkpointing (KI-3): the no-tape forward + recompute-on-backward design, the `checkpoint` primitive, per-block wrapping, the exactness/correctness argument (same kernels + inputs → identical grads), composition with bf16+DDP+batched, and the verification plan (on-vs-off grad gate + memory/throughput before→after, dim1024-fits). Bench table left as TBD to fill after the dash5 run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:45:16 +08:00

1 2 3

134 Commits