xtrain

Author	SHA1	Message	Date
Gahow Wang	0e82b2438e	test: M2d — ragged-forward + batched-op equivalence gates + throughput bench Two exact correctness gates (composed = the end-to-end batched GRPO step == looped): - xtrain-model forward_batched_ragged_matches_looped: forward_batched on RIGHT-padded ragged sequences == per-sequence single-seq forward on the real rows. fp32 max\|Δlogit\| = 3.7e-7, bf16 = 0.0, both composed + flash SDPA. Pins "right-pad is free under causal". - xtrain-autodiff clipped_pg_loss_batched_matches_looped: batched op == looped Σ_s (1/N)·clipped_pg_loss_s. loss Δ=1.5e-8, grad max\|Δ\|=7.5e-9 (f32). bench_grpo_batch: weight-independent micro-bench of the per-sample training forwards (loads v12 base as policy, N realistic ragged samples, teacher-forced argmax targets so the closeness smoke isn't −log-amplified by random low-prob tokens). Measured on dash5 (v12 1.05B, N=48, micro=16): capture 622→71 ms (8.7×), inner 1907→208 ms (9.2×), training forwards 2526→280 ms (9.0×). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 23:03:09 +08:00
Gahow Wang	c2ebf62ae1	post-train: M2d — batch the GRPO training-side forwards (op + module + wiring) After M2b/M2c made the rollout cheap, the GRPO step is dominated by the per-sample single-sequence training-side forwards: the per_token_logp captures (policy + reference) and the inner clipped-PG forward/backwards. M2d packs all N=B·G ragged samples of a step into ONE forward_batched. Enabling property — right-padding is free under causal attention: a real completion row sits at an earlier position than the trailing pad, and causal masking forbids attending forward, so its logits equal the unpadded single-sequence forward; pad rows are masked out (target=-100). - ops::clipped_pg_loss_batched: like clipped_pg_loss but takes per-row advantage[t] (the owning sample's A) and per-row weight[t] (the full normaliser). It does NOT compute its own 1/n_tokens, so the caller passing weight=1/(N·n_s) reproduces the looped Σ_s (1/N)(1/n_s)·clipped_pg_loss_s bit-for-bit (per-row CE backward is row-local). - grpo_batch.rs (shared module): per_token_logp_batched (right-pad → one forward_batched(N) → slice back to real length) + looped baselines + inner_pg_step_{looped,batched}. A --micro knob chunks the pack to bound the [chunk·Lmax, vocab] logits memory; weight uses the GLOBAL N so chunked grad-accumulation stays exact. - train_grpo restructured to collect-all-samples-then-batch; per-window phase timers (rollout / capture / inner) to keep the step decomposition honest. Default micro = B·G; bench-measured 9× on the training forwards. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 23:02:56 +08:00
Gahow Wang	aaa77082ef	post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op) The GRPO (M4) token-level loss op + the one primitive it needs: - scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The clipped-PG backward scales each completion token's row of (probs − onehot) by its own per-token coefficient, which cross_entropy_backward's single scalar scale can't express. - clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL (k3 estimator), masked to completion tokens. Backward reuses the CE machinery (probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0 (KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 14:07:02 +08:00
Gahow Wang	f3c764ce95	post-train: M3 — seq_logprob + dpo_loss autograd ops Two new ops for DPO (M3), both reusing existing kernels (no new CUDA): - seq_logprob(logits, target): Σ log πθ(target) over non-ignored (target≥0) positions — the per-sequence logprob DPO compares between policy and reference. = −Σ per_row of cross_entropy (ignored rows already 0, like SFT masking); backward = cross_entropy_backward(probs, target, −upstream) (sum, no mean division). Gate: finite-diff grad-check with a -100 completion mask. - dpo_loss(lpθ_chosen, lpθ_rejected, lpref_chosen, lpref_rejected, β): scalar L = −log σ(Δ) = softplus(−Δ) with the two policy logprobs as parents (ref logprobs constant). Gate: grad-check both parents + degenerate points (policy==ref ⇒ Δ=0, L=log2, grads ∓β/2; β=0 ⇒ grads 0). Same formula as TRL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:11:01 +08:00
Gahow Wang	fbf4ac2917	sft: assistant-only SFT (ignore-index CE) + chat-prompt greedy eval Enable assistant-only supervised fine-tuning and a fixed chat-prompt eval path used by the v12 SFT runs: - cross_entropy ignores negative targets (-100 ignore-index), normalizing by valid rows instead of all rows; CUDA fwd/bwd skip t<0 (ops.rs, nn.cu). - Corpus gains optional labels + load_sft_tsv_cached: two-column TSV is formatted as 'User: .. \nAssistant:' + answer + <\|endoftext\|>, prompt tokens masked to -100 while answer+EOS are supervised; i32 label cache alongside the u16 token cache; sample() retries windows that are fully masked; eval uses target_window so masking applies to val loss too (data.rs, train_loop.rs). - train + train_ddp: --sft-tsv selects the TSV loader, --init-ckpt continues training from a base checkpoint. - greedy_sample: --prompts-file/--prompt/--temperature for fixed chat-prompt generation eval. Test fixtures updated for the new Corpus.labels field; dropout.rs carries incidental rustfmt. Not rebuilt locally (no CUDA toolchain on this checkout); correctness rests on the documented v12 base+SFT runs on the GPU box. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 16:19:02 +08:00
Gahow Wang	830d06ad01	gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) - repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each kv head sums its group of query-head grads; no atomics) + Tensor/ops node. - Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim; attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical. - --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export writes real num_key_value_heads (xserv repeat_kv grouping aligned). - Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape); parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:37:37 +08:00
Gahow Wang	c36cdf74d1	Merge t18-dropout into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # crates/xtrain-autodiff/tests/autograd.rs # crates/xtrain-model/src/model.rs # crates/xtrain-train/src/bin/train.rs # crates/xtrain-train/src/train_loop.rs # docs/evolution.md	2026-06-18 00:41:41 +08:00
Gahow Wang	5eb27783f8	dropout: autodiff op + fixed-seed grad-check (T18) ops::dropout(x,p,seed): fwd runs Tensor::dropout, caches the mask in the backward closure, bwd pushes dx=d⊙mask. p==0 returns x.clone() (no node) so the default graph is unchanged. Tests in autograd.rs: fixed-seed finite-diff grad-check (mask held constant across the ± perturbation — dropout is a fixed elementwise linear map of x); E[out]≈input + keep-rate≈1-p over a seed sweep; p=0 kernel identity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:32 +08:00
Gahow Wang	c0f0b67510	test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:44 +08:00
Gahow Wang	80602099dc	test: scale Q/K in flash grad-check for well-conditioned grads Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:04 +08:00
Gahow Wang	f38beb0346	test: flash finite-diff grad-check uses single-tile clean regime Match the trusted composed grad-check dims (seq=5<FA_TILE); the multi-tile online-softmax path is gated by flash_bwd_matches_composed_bwd (seq=40), sharper than finite-diff on the near-zero grads a long softmax produces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:16:20 +08:00
Gahow Wang	01fb22d114	test: flash bwd vs composed bwd (sharper than finite-diff) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:12:30 +08:00
Gahow Wang	5f3b81ac96	test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile) + flash_matches_composed_fwd. model/tests/flash.rs: flash==composed on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump: XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle (PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:39 +08:00
Gahow Wang	0e20821633	autodiff+model: flash-attention op + --flash opt-in wiring ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a use_flash bool + with_flash(bool) builder; the SDPA core in attention() picks ops::flash_attention vs ops::attention. flash threads through block_forward so the recompute (T13) segment also runs flash. Default off = composed path, graph unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:32 +08:00
Gahow Wang	c396b39483	autodiff: checkpoint primitive (recompute-on-backward) Add `xtrain_autodiff::checkpoint::checkpoint(segment_fn, input, params)`, a higher-order autograd node (à la torch.utils.checkpoint) for activation recomputation (Phase T13 / KI-3): - forward: run `segment_fn` on detached leaves so its internal ops are NOT recorded on the outer tape; keep only the output value (the local sub-tape — and thus the segment's intermediate activations — drops immediately). The checkpoint node's parents are [input, ..params]. - backward: re-run `segment_fn` from the saved input + (unchanged) param values into a fresh local tape, seed the recomputed output with the upstream grad, backprop, then push the recovered input/param grads to the real parents. Local tape drops at the end → recomputed activations freed. Exact by construction (same deterministic kernels, same inputs) → grads match the non-checkpointed path. Composes with bf16 (T12, same path on recompute) and DDP (T8, per-rank). Supporting change: `Var::backward_seeded(seed)` — backward from an explicit non-scalar upstream grad (the segment output is generally not a scalar); `backward()` is now the scalar wrapper that seeds ones. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:42:31 +08:00
Gahow Wang	48922cb628	perf: keep bf16 logits (no persistent fp32 logits buffer) At vocab 50257 the logits tensor [B*S, vocab] is ~1.6GB fp32 at batch 32 — held across the whole backward. Keep it bf16: cross_entropy upcasts the bf16 logits to fp32 internally (transient) + caches fp32 probs, and its backward casts dx back to bf16 to chain into the bf16 lm_head matmul backward. The sampler casts bf16 logits→f32 before the host argmax/softmax. Halves the persistent logits activation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:20:48 +08:00
Gahow Wang	b0086b5214	autodiff: bf16 mixed-precision path (fp32 master via cast op) Tensor ops dispatch on dtype: fp32 branch unchanged (bit-identical), bf16 branch routes matmul/attention through GemmEx and elementwise through the bf16 kernels. Norm/softmax/RoPE/cross-entropy upcast to fp32 around the existing fp32 kernels (standard AMP: reductions/loss fp32, matmuls bf16). Transposes route bf16 through fp32 (pure layout). New autodiff `cast` op is the AMP bridge: forward downcasts a fp32 master leaf to bf16 for the matmul; backward upcasts the bf16 grad back to fp32. So the fp32 leaf accumulates an fp32 grad and AdamW / clip / DDP all-reduce stay fp32 and completely unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:48 +08:00
Gahow Wang	7821bd9c34	autograd: batch dim for ops (flatten linears, batched attention) Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE already act on flat [rows,dim], so they work unchanged on [BS,dim]; only attention + RoPE need sequence awareness: - RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e. per-sequence position on a flattened batch (period == tokens = single seq). - Fused batched causal attention: new `Tensor::attention`/`attention_backward` + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the Bnh (sequence,head) blocks (new sgemm_strided_batched binding) and a causal softmax kernel (scale + per-row causal mask inline) — the whole attention is 3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip. - transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads. grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside the existing 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:15 +08:00
Gahow Wang	0acfa5df11	ops: grad-check the T5 structural ops Finite-diff grad-checks (same L=sum(W∘out) harness as autograd.rs) for embedding (incl. repeated ids), reshape, transpose_3d01, transpose_2d, and split/merge_heads round-trip. Gated #![cfg(not(no_cuda))]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:20 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	e7ce504b1f	ops: differentiable autograd nodes + per-op grad-check tests ops.rs wraps each Tensor op as a Var node with its backward closure (forward caches captured by move). swiglu = mul(silu(gate), up); attention is composed (matmul+scale+softmax+matmul), no fused kernel. tests/autograd.rs grad-checks every op via the L=sum(W∘out) template, plus a fan-out grad-accumulation test (dL/dx=4x) and an end-to-end composed-attention grad-check (dQ/dK/dV). Adds xtrain-cuda dev-dep for device selection in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:53:55 +08:00
Gahow Wang	224f750ee4	autograd: tape engine + grad accumulation Var = Rc<RefCell<VarNode>> on a define-by-run tape: value + optional grad + parents + backward closure. backward() seeds a scalar loss, walks reverse topo order, and pushes grads to parents. push_grad always SUMs into the grad slot — the fan-out accumulation path T3 lacked. Per-crate build.rs emits the no_cuda cfg (does not propagate); engine gated, grad_check stays host-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:17 +08:00
Gahow Wang	9ca98efd98	autodiff: finite-diff gradient-check harness New xtrain-autodiff crate with a reusable central finite-difference gradient check: grad_check(x, shape, f, analytic_grad, cfg) compares an analytic gradient against (f(x+ε)-f(x-ε))/2ε per element with a relative tolerance. Host-only (no CUDA): the loss closure owns any GPU work, so T4's per-op backward checks can reuse it directly. Includes host unit tests (sum(x²) grad 2x passes; a wrong grad is rejected). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:42 +08:00

23 Commits