xtrain

Author	SHA1	Message	Date
Gahow Wang	3a3425960c	post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift Device-resident KV cache: keep K/V on the GPU as [bh,T,hd], grow by one token per step via a new cat_seq kernel (concat along seq) — removes the M2a/M2b per-layer host round-trip (to_cpu/from_slice/re-upload) AND the transpose_3d01. Both single-seq and batched decode refactored to it; cache is Option<Tensor> per layer (cleaner than the host Vec + rebuild). Gates all hold: cat_seq == host concat; decode_kv single-seq + decode_batch G-way both still TOKEN-IDENTICAL; GQA training path unaffected. Honest measurement (the point): removing the host round-trip buys ~10% on pure single-seq decode (133 → 147 tok/s @128) but does NOT move the GRPO step (~8.5 s/step unchanged) — because after M2b batching the rollout is no longer the step's bottleneck; the per-sample per_token_logp captures + the PG-update forwards/backwards (model.forward, full-seq) now dominate. Measure-first lesson (cf. T11/T17/M2a): the long pole shifted to the training-side forwards; the next decode lever (ragged batched prefill) targets those, not the cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 17:38:16 +08:00
Gahow Wang	2c9b58cb3b	post-train: M2b — batched KV-cache decode (G-way, token-identical) The rollout long-pole fix deferred from M2a: decode the G samples of one prompt in lockstep (one forward per step over the group → G× fewer kernel launches). - rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward- only kernel) — G rows share one decode position. Gate: == full rope for [0..n], == rope_at(P) per row for uniform P (bit-identical). - generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step. decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G) broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next). - Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single- sequence decode (8 query / 2 kv heads → exercises repeat_kv batching). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 17:18:54 +08:00
Gahow Wang	aaa77082ef	post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op) The GRPO (M4) token-level loss op + the one primitive it needs: - scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The clipped-PG backward scales each completion token's row of (probs − onehot) by its own per-token coefficient, which cross_entropy_backward's single scalar scale can't express. - clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL (k3 estimator), masked to completion tokens. Backward reuses the CE machinery (probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0 (KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 14:07:02 +08:00
Gahow Wang	c88e2ab88c	post-train: M2 — decode primitives (rope_at + decode_attention) Two forward-only Tensor primitives the KV-cache decode engine is built on, each gated by an isolated correctness test: - rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no modulo) for a single decode token, vs the training rope_k (pos = row % period) left untouched. New forward-only CUDA kernel, no training-path risk. Gate: bit-identical to the full-sequence rope's corresponding row. - decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from the existing strided batched GEMM + plain (non-causal) softmax — no new kernel. Gate: equals the full causal attention's last query row (max \|Δ\| 6e-8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:00:03 +08:00
Gahow Wang	830d06ad01	gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) - repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each kv head sums its group of query-head grads; no atomics) + Tensor/ops node. - Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim; attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical. - --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export writes real num_key_value_heads (xserv repeat_kv grouping aligned). - Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape); parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:37:37 +08:00
Gahow Wang	c36cdf74d1	Merge t18-dropout into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # crates/xtrain-autodiff/tests/autograd.rs # crates/xtrain-model/src/model.rs # crates/xtrain-train/src/bin/train.rs # crates/xtrain-train/src/train_loop.rs # docs/evolution.md	2026-06-18 00:41:41 +08:00
Gahow Wang	1fdd0c5002	dropout: device RNG kernel + Tensor fwd/bwd (T18) csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32 uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer (per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16 activation variants (mask fp32 in both; uniform is dtype-independent so masks match across precisions). Stateless → re-run with same seed = same mask (T13 recompute-safe). Registered in build.rs + FFI decls. Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the launches (contiguous F32/BF16, default stream, per-op sync via the kernels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:18 +08:00
Gahow Wang	326a6fadfe	cuda: fused flash-attention kernel (fwd + flash-style bwd) csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per query row, streams KV in tiles of 32, online softmax — running max/sum + rescaled V accumulator, causal mask inlined, never materializes the [bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N), saved for backward). flash-style bwd: recompute scores from Q/K/V + L, collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV atomicAdd across rows. Tensor::flash_attention / flash_attention_backward wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax policy as composed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:25 +08:00
Gahow Wang	b0086b5214	autodiff: bf16 mixed-precision path (fp32 master via cast op) Tensor ops dispatch on dtype: fp32 branch unchanged (bit-identical), bf16 branch routes matmul/attention through GemmEx and elementwise through the bf16 kernels. Norm/softmax/RoPE/cross-entropy upcast to fp32 around the existing fp32 kernels (standard AMP: reductions/loss fp32, matmuls bf16). Transposes route bf16 through fp32 (pure layout). New autodiff `cast` op is the AMP bridge: forward downcasts a fp32 master leaf to bf16 for the matmul; backward upcasts the bf16 grad back to fp32. So the fp32 leaf accumulates an fp32 grad and AdamW / clip / DDP all-reduce stay fp32 and completely unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:48 +08:00
Gahow Wang	d05115ddf3	cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels Add the bf16 compute primitives for T12 mixed precision: - DType::BF16 (half::bf16 as TensorDType), 2 bytes. - cublasGemmEx / cublasGemmStridedBatchedEx FFI + CUDA_R_16BF / CUBLAS_COMPUTE_32F constants (values per xserv gemm.rs). - cublas::gemm_ex / gemm_ex_strided_batched: same row-major⟺col-major transpose algebra as sgemm, bf16 in/out, fp32 accumulation. - csrc/ops/cast.cu: f32<->bf16 cast + bf16 elementwise (add/mul/scale/ silu(+dx)/add_bias/sum_rows), each load->fp32->compute->store bf16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:39 +08:00
Gahow Wang	7821bd9c34	autograd: batch dim for ops (flatten linears, batched attention) Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE already act on flat [rows,dim], so they work unchanged on [BS,dim]; only attention + RoPE need sequence awareness: - RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e. per-sequence position on a flattened batch (period == tokens = single seq). - Fused batched causal attention: new `Tensor::attention`/`attention_backward` + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the Bnh (sequence,head) blocks (new sgemm_strided_batched binding) and a causal softmax kernel (scale + per-row causal mask inline) — the whole attention is 3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip. - transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads. grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside the existing 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:15 +08:00
Gahow Wang	a842e432b5	perf: streams / drop per-op sync Default-stream kernels run in order and every host read goes through a stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize after each kernel was pure overhead — remove all 21 in tensor.rs. Host data is still correctly ordered by the D2H memcpy that reads it. Also zero op-output buffers with cudaMemset (device-side, async) instead of a blocking H2D memcpy of a host zero buffer on every allocation — that copy was itself a hidden per-op sync point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:56:17 +08:00
Gahow Wang	0e5c7d22e2	perf: cuBLAS matmul fwd/bwd Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so the T3 cuBLAS tolerance and downstream grad-checks are preserved. - cublas.rs: thread-local persistent handle + row-major sgemm helper with transpose flags (col-major⟺row-major as the T3 oracle does). - matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose kernels + their allocations the T3 version ran. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:48:35 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	5aef3742d6	ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy, each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor gains the matching thin wrappers; DType grows I32 for cross-entropy targets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:09 +08:00
Gahow Wang	88fbe0a85d	gemm: realistic f32 tolerances in GEMM acceptance tests Forward: compare via matrix relative error (max abs error / max\|ref\|) instead of a per-element ratio, so near-zero outputs where two correct f32 GEMMs differ only in rounding order don't inflate the metric. Backward: L = sum(W∘C) is bilinear, so central differences are truncation-free — use eps=1e-2 (sharper f32 resolution of the difference) and atol=1e-3 to floor near-zero-gradient subtraction noise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:28:57 +08:00
Gahow Wang	1384044f27	gemm: GPU acceptance tests vs cuBLAS + finite-diff Forward: hand-written tiled GEMM vs cuBLAS sgemm on random matrices (square / non-tile-aligned rect / 256³), max relative error < 1e-3, using the row-major⟺col-major identity to drive cuBLAS without explicit transposes. Backward: scalar loss L = sum(W∘C) (so dC = W), dA/dB from matmul_backward checked against the finite-diff harness. Gated behind not(no_cuda). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:58 +08:00
Gahow Wang	08c88bf360	gemm: tiled F32 forward + transpose + backward (dA/dB) Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate, boundary-masked) plus an out-of-place transpose kernel. Wire both through xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level: Tensor::matmul, transpose_2d, and matmul_backward computing dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a correctness reference in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:51 +08:00
Gahow Wang	fbd07a578c	tensor: minimal Tensor crate over xtrain-cuda New xtrain-tensor crate: DType (F32), shape/stride helpers, Arc-counted host/device Storage with CPU↔CUDA copy, and a contiguous Tensor with creation, host↔device transfer, and a scale() op driving the elementwise kernel. GPU integration tests (host↔device roundtrip + scale correctness) gated behind not(no_cuda); a thin build.rs emits the no_cuda cfg so the kernel call sites compile out locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00

19 Commits