xtrain

Author	SHA1	Message	Date
Gahow Wang	3a3425960c	post-train: M2c — device-side KV cache (cat_seq), profile-first bottleneck shift Device-resident KV cache: keep K/V on the GPU as [bh,T,hd], grow by one token per step via a new cat_seq kernel (concat along seq) — removes the M2a/M2b per-layer host round-trip (to_cpu/from_slice/re-upload) AND the transpose_3d01. Both single-seq and batched decode refactored to it; cache is Option<Tensor> per layer (cleaner than the host Vec + rebuild). Gates all hold: cat_seq == host concat; decode_kv single-seq + decode_batch G-way both still TOKEN-IDENTICAL; GQA training path unaffected. Honest measurement (the point): removing the host round-trip buys ~10% on pure single-seq decode (133 → 147 tok/s @128) but does NOT move the GRPO step (~8.5 s/step unchanged) — because after M2b batching the rollout is no longer the step's bottleneck; the per-sample per_token_logp captures + the PG-update forwards/backwards (model.forward, full-seq) now dominate. Measure-first lesson (cf. T11/T17/M2a): the long pole shifted to the training-side forwards; the next decode lever (ragged batched prefill) targets those, not the cache. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 17:38:16 +08:00
Gahow Wang	2c9b58cb3b	post-train: M2b — batched KV-cache decode (G-way, token-identical) The rollout long-pole fix deferred from M2a: decode the G samples of one prompt in lockstep (one forward per step over the group → G× fewer kernel launches). - rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward- only kernel) — G rows share one decode position. Gate: == full rope for [0..n], == rope_at(P) per row for uniform P (bit-identical). - generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step. decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G) broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next). - Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single- sequence decode (8 query / 2 kv heads → exercises repeat_kv batching). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 17:18:54 +08:00
Gahow Wang	aaa77082ef	post-train: M4 — clipped_pg_loss + scale_rows (GRPO policy-gradient op) The GRPO (M4) token-level loss op + the one primitive it needs: - scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The clipped-PG backward scales each completion token's row of (probs − onehot) by its own per-token coefficient, which cross_entropy_backward's single scalar scale can't express. - clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL (k3 estimator), masked to completion tokens. Backward reuses the CE machinery (probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0 (KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 14:07:02 +08:00
Gahow Wang	c88e2ab88c	post-train: M2 — decode primitives (rope_at + decode_attention) Two forward-only Tensor primitives the KV-cache decode engine is built on, each gated by an isolated correctness test: - rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no modulo) for a single decode token, vs the training rope_k (pos = row % period) left untouched. New forward-only CUDA kernel, no training-path risk. Gate: bit-identical to the full-sequence rope's corresponding row. - decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from the existing strided batched GEMM + plain (non-causal) softmax — no new kernel. Gate: equals the full causal attention's last query row (max \|Δ\| 6e-8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:00:03 +08:00
Gahow Wang	830d06ad01	gqa: real grouped-query attention (repeat_kv op + both SDPA paths + wiring + tests) - repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each kv head sums its group of query-head grads; no atomics) + Tensor/ops node. - Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim; attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed & flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical. - --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export writes real num_key_value_heads (xserv repeat_kv grouping aligned). - Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs (GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape); parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 01:37:37 +08:00
Gahow Wang	c36cdf74d1	Merge t18-dropout into main Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> # Conflicts: # README.md # crates/xtrain-autodiff/tests/autograd.rs # crates/xtrain-model/src/model.rs # crates/xtrain-train/src/bin/train.rs # crates/xtrain-train/src/train_loop.rs # docs/evolution.md	2026-06-18 00:41:41 +08:00
Gahow Wang	1fdd0c5002	dropout: device RNG kernel + Tensor fwd/bwd (T18) csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32 uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer (per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16 activation variants (mask fp32 in both; uniform is dtype-independent so masks match across precisions). Stateless → re-run with same seed = same mask (T13 recompute-safe). Registered in build.rs + FFI decls. Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the launches (contiguous F32/BF16, default stream, per-op sync via the kernels). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 00:05:18 +08:00
Gahow Wang	326a6fadfe	cuda: fused flash-attention kernel (fwd + flash-style bwd) csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per query row, streams KV in tiles of 32, online softmax — running max/sum + rescaled V accumulator, causal mask inlined, never materializes the [bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N), saved for backward). flash-style bwd: recompute scores from Q/K/V + L, collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV atomicAdd across rows. Tensor::flash_attention / flash_attention_backward wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax policy as composed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:25 +08:00
Gahow Wang	d05115ddf3	cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels Add the bf16 compute primitives for T12 mixed precision: - DType::BF16 (half::bf16 as TensorDType), 2 bytes. - cublasGemmEx / cublasGemmStridedBatchedEx FFI + CUDA_R_16BF / CUBLAS_COMPUTE_32F constants (values per xserv gemm.rs). - cublas::gemm_ex / gemm_ex_strided_batched: same row-major⟺col-major transpose algebra as sgemm, bf16 in/out, fp32 accumulation. - csrc/ops/cast.cu: f32<->bf16 cast + bf16 elementwise (add/mul/scale/ silu(+dx)/add_bias/sum_rows), each load->fp32->compute->store bf16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:39 +08:00
Gahow Wang	28801fbfe5	cuda: device caching allocator (pool GpuBuffer alloc) Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc -> cudaMalloc, a synchronous process-serialized driver call. Under the single- process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs serialize through the driver (KI-5 root cause); it costs single-GPU too. Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a free-list (request rounded up to a size class so repeating training shapes reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool instead of cudaFree. Thread-safe via a global registry keyed by device id with each device's free-list behind its own Mutex (registry lock held only to clone out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices). The buffer records its alloc-time device so Drop returns to the right pool. Transparent: physical capacity may be rounded up, but len()/memset/copy bounds all use the requested length, so the rounded tail is never read and numerics are unchanged. zeros() still memsets (reused buffers hold stale bytes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:02 +08:00
Gahow Wang	88c2c15768	Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" This reverts commit `b8b58212dc`.	2026-06-16 09:39:38 +08:00
Gahow Wang	b8b58212dc	dist: coalesce grads into buckets for all-reduce (KI-5) Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls for dim512, DDP's dominant cost after T10's batched forward) with a coalesced bucketed all-reduce: pack grads into a few large contiguous scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/ End), fold the 1/world average into one per-bucket scale, unpack back. The packed buffer is the concatenation of the grad tensors, so NCCL's element-wise sum over a bucket equals the per-tensor sums — bit-identical to the un-bucketed path; only launch/latency overhead is removed. DDP cross-rank param identity + loss-match are preserved. Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:09:44 +08:00
Gahow Wang	7821bd9c34	autograd: batch dim for ops (flatten linears, batched attention) Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE already act on flat [rows,dim], so they work unchanged on [BS,dim]; only attention + RoPE need sequence awareness: - RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e. per-sequence position on a flattened batch (period == tokens = single seq). - Fused batched causal attention: new `Tensor::attention`/`attention_backward` + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the Bnh (sequence,head) blocks (new sgemm_strided_batched binding) and a causal softmax kernel (scale + per-row causal mask inline) — the whole attention is 3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip. - transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads. grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside the existing 12. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:15 +08:00
Gahow Wang	a842e432b5	perf: streams / drop per-op sync Default-stream kernels run in order and every host read goes through a stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize after each kernel was pure overhead — remove all 21 in tensor.rs. Host data is still correctly ordered by the D2H memcpy that reads it. Also zero op-output buffers with cudaMemset (device-side, async) instead of a blocking H2D memcpy of a host zero buffer on every allocation — that copy was itself a hidden per-op sync point. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:56:17 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	0e5c7d22e2	perf: cuBLAS matmul fwd/bwd Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so the T3 cuBLAS tolerance and downstream grad-checks are preserved. - cublas.rs: thread-local persistent handle + row-major sgemm helper with transpose flags (col-major⟺row-major as the T3 oracle does). - matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose kernels + their allocations the T3 version ran. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:48:35 +08:00
Gahow Wang	7fb1a29057	ops: embedding/reshape/transpose/split-merge-heads fwd+bwd Phase T5 structural ops on top of the T4 set, needed to assemble the tiny transformer: - embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward (atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi. - reshape: contiguous metadata-only view (Tensor::reshape), no kernel. - transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel). - autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus split_heads (->Vec<Var>) / merge_heads for per-head attention. - tape: Var::zero_grad + set_value so a hand-written GD step can update params and clear grads between steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:05:09 +08:00
Gahow Wang	5aef3742d6	ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy, each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor gains the matching thin wrappers; DType grows I32 for cross-entropy targets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:44:09 +08:00
Gahow Wang	08c88bf360	gemm: tiled F32 forward + transpose + backward (dA/dB) Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate, boundary-masked) plus an out-of-place transpose kernel. Wire both through xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level: Tensor::matmul, transpose_2d, and matmul_backward computing dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a correctness reference in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:26:51 +08:00
Gahow Wang	63dc05fd10	tensor: add scale elementwise CUDA kernel + FFI New csrc/ops/elementwise.cu (out[i]=in[i]*alpha), compiled by xtrain-cuda/build.rs and exposed via launch_scale_f32 FFI, gated behind not(no_cuda) like the existing vecadd smoke test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:13:06 +08:00
Gahow Wang	92acf9f413	T1: scaffold repo + Rust/CUDA build chain (vecadd smoke test) Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu via the cc crate targeting sm_120 (RTX 5090) and links cudart. A gated integration test runs the vector-add kernel on the GPU and asserts the result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA compilation and sets a `no_cuda` cfg so host-side cargo check still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 14:42:43 +08:00

21 Commits