Device-resident KV cache: keep K/V on the GPU as [bh,T,hd], grow by one token
per step via a new cat_seq kernel (concat along seq) — removes the M2a/M2b
per-layer host round-trip (to_cpu/from_slice/re-upload) AND the transpose_3d01.
Both single-seq and batched decode refactored to it; cache is Option<Tensor>
per layer (cleaner than the host Vec + rebuild).
Gates all hold: cat_seq == host concat; decode_kv single-seq + decode_batch
G-way both still TOKEN-IDENTICAL; GQA training path unaffected.
Honest measurement (the point): removing the host round-trip buys ~10% on pure
single-seq decode (133 → 147 tok/s @128) but does NOT move the GRPO step
(~8.5 s/step unchanged) — because after M2b batching the rollout is no longer
the step's bottleneck; the per-sample per_token_logp captures + the PG-update
forwards/backwards (model.forward, full-seq) now dominate. Measure-first lesson
(cf. T11/T17/M2a): the long pole shifted to the training-side forwards; the next
decode lever (ragged batched prefill) targets those, not the cache.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rollout long-pole fix deferred from M2a: decode the G samples of one prompt
in lockstep (one forward per step over the group → G× fewer kernel launches).
- rope_pos(x, positions[]): RoPE with a per-row absolute position (new forward-
only kernel) — G rows share one decode position. Gate: == full rope for
[0..n], == rope_at(P) per row for uniform P (bit-identical).
- generate_cached_batch: BatchKVCache [T, G·num_kv, hd] + batched decode_step.
decode_attention is already batch-agnostic (bh = G·nh); repeat_kv(nh, batch=G)
broadcasts per group. No finished-mask / ragged prompts yet (perf-only / next).
- Gate (tests/decode_batch.rs): all G greedy rows token-identical to the single-
sequence decode (8 query / 2 kv heads → exercises repeat_kv batching).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The GRPO (M4) token-level loss op + the one primitive it needs:
- scale_rows(x[r,c], s[r]): per-row scale (new ~5-line CUDA kernel). The
clipped-PG backward scales each completion token's row of (probs − onehot) by
its own per-token coefficient, which cross_entropy_backward's single scalar
scale can't express.
- clipped_pg_loss(logits, target, logp_old, logp_ref, A, eps, beta): per-token
ρ_t = exp(logπθ_t − logp_old_t), L = −mean min(ρA, clip(ρ,1±ε)A) + β·mean KL
(k3 estimator), masked to completion tokens. Backward reuses the CE machinery
(probs − onehot) + scale_rows. Gates: grad-check the active PG path + the A=0
(KL-only) path; degenerate value checks ε→∞ ⇒ vanilla PG, β=0 ⇒ no KL.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two forward-only Tensor primitives the KV-cache decode engine is built on,
each gated by an isolated correctness test:
- rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no
modulo) for a single decode token, vs the training rope_k (pos = row %
period) left untouched. New forward-only CUDA kernel, no training-path risk.
Gate: bit-identical to the full-sequence rope's corresponding row.
- decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from
the existing strided batched GEMM + plain (non-causal) softmax — no new
kernel. Gate: equals the full causal attention's last query row (max |Δ| 6e-8).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- repeat_kv CUDA kernel: fwd head-block gather, bwd DETERMINISTIC group-sum (each
kv head sums its group of query-head grads; no atomics) + Tensor/ops node.
- Config gains num_kv_heads (default = n_heads → MHA); wk/wv project to kv_dim;
attention() repeat_kv-broadcasts K/V to nh heads before the UNCHANGED composed
& flash SDPA → GQA on both paths. group=1 is identity → MHA bit-identical.
- --kv-heads flag on train/train_ddp/export_safetensors/greedy_sample; export
writes real num_key_value_heads (xserv repeat_kv grouping aligned).
- Tests: repeat_kv grad-check (group>1 grad-sum + group=1 identity); model gqa.rs
(GQA flash==composed fp32/bf16, group=1 bit-identical to MHA, kv-proj shape);
parity_dump+parity.py GQA path (repeat_interleave) via XTRAIN_PARITY_KV_HEADS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32
uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer
(per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16
activation variants (mask fp32 in both; uniform is dtype-independent so masks
match across precisions). Stateless → re-run with same seed = same mask (T13
recompute-safe). Registered in build.rs + FFI decls.
Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the
launches (contiguous F32/BF16, default stream, per-op sync via the kernels).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per
query row, streams KV in tiles of 32, online softmax — running max/sum
+ rescaled V accumulator, causal mask inlined, never materializes the
[bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N),
saved for backward). flash-style bwd: recompute scores from Q/K/V + L,
collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV
atomicAdd across rows. Tensor::flash_attention / flash_attention_backward
wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax
policy as composed).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc ->
cudaMalloc, a synchronous process-serialized driver call. Under the single-
process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs
serialize through the driver (KI-5 root cause); it costs single-GPU too.
Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a
free-list (request rounded up to a size class so repeating training shapes
reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool
instead of cudaFree. Thread-safe via a global registry keyed by device id with
each device's free-list behind its own Mutex (registry lock held only to clone
out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices).
The buffer records its alloc-time device so Drop returns to the right pool.
Transparent: physical capacity may be rounded up, but len()/memset/copy bounds
all use the requested length, so the rounded tail is never read and numerics are
unchanged. zeros() still memsets (reused buffers hold stale bytes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls
for dim512, DDP's dominant cost after T10's batched forward) with a
coalesced bucketed all-reduce: pack grads into a few large contiguous
scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/
End), fold the 1/world average into one per-bucket scale, unpack back.
The packed buffer is the concatenation of the grad tensors, so NCCL's
element-wise sum over a bucket equals the per-tensor sums — bit-identical
to the un-bucketed path; only launch/latency overhead is removed. DDP
cross-rank param identity + loss-match are preserved.
Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE
already act on flat [rows,dim], so they work unchanged on [B*S,dim]; only
attention + RoPE need sequence awareness:
- RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e.
per-sequence position on a flattened batch (period == tokens = single seq).
- Fused batched causal attention: new `Tensor::attention`/`attention_backward`
+ ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the B*nh
(sequence,head) blocks (new sgemm_strided_batched binding) and a causal
softmax kernel (scale + per-row causal mask inline) — the whole attention is
3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip.
- transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads.
grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all
pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside
the existing 12.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Default-stream kernels run in order and every host read goes through a
stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize
after each kernel was pure overhead — remove all 21 in tensor.rs. Host
data is still correctly ordered by the D2H memcpy that reads it.
Also zero op-output buffers with cudaMemset (device-side, async) instead of
a blocking H2D memcpy of a host zero buffer on every allocation — that
copy was itself a hidden per-op sync point.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.
- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
(block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
each param's .grad() and rewriting the param buffer in place — no host
roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Route Tensor::matmul and matmul_backward through cuBLAS Sgemm instead of
the hand-written tiled kernel. fp32 → same GEMM up to rounding order, so
the T3 cuBLAS tolerance and downstream grad-checks are preserved.
- cublas.rs: thread-local persistent handle + row-major sgemm helper with
transpose flags (col-major⟺row-major as the T3 oracle does).
- matmul_backward: dA/dB via cuBLAS OP_T, dropping the two transpose
kernels + their allocations the T3 version ran.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase T5 structural ops on top of the T4 set, needed to assemble the
tiny transformer:
- embedding: gather rows by I32 ids (CUDA kernel) / scatter-add backward
(atomic, so repeated ids accumulate). csrc/ops/model.cu + ffi.
- reshape: contiguous metadata-only view (Tensor::reshape), no kernel.
- transpose_3d01: [a,b,c]->[b,a,c] for the multi-head layout (kernel).
- autograd nodes: embedding/reshape/transpose_3d01/transpose_2d, plus
split_heads (->Vec<Var>) / merge_heads for per-head attention.
- tape: Var::zero_grad + set_value so a hand-written GD step can update
params and clear grads between steps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
add/mul/add_bias(+sum_rows)/rms_norm/silu/rope/softmax/cross_entropy,
each with its analytic backward, in csrc/ops/nn.cu (inlined warp/block
reductions). FFI declarations + nn.cu in build.rs (no_cuda gated). Tensor
gains the matching thin wrappers; DType grows I32 for cross-entropy targets.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Hand-written tiled GEMM (csrc/ops/gemm.cu, TILE_SIZE=32, FP32 accumulate,
boundary-masked) plus an out-of-place transpose kernel. Wire both through
xtrain-cuda FFI (no_cuda-gated) and expose at the tensor level:
Tensor::matmul, transpose_2d, and matmul_backward computing
dA = dC·Bᵀ and dB = Aᵀ·dC by materializing transposes and reusing the
forward. Also declare cuBLAS sgemm FFI + link cublas, used only as a
correctness reference in tests.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New csrc/ops/elementwise.cu (out[i]=in[i]*alpha), compiled by
xtrain-cuda/build.rs and exposed via launch_scale_f32 FFI, gated behind
not(no_cuda) like the existing vecadd smoke test.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stand up the xtrain project skeleton: a Cargo workspace mirroring xserv's
csrc/ + crates/ layout, with a single xtrain-cuda crate that wraps the CUDA
Runtime over hand-written extern "C" FFI. build.rs compiles csrc/test/vecadd.cu
via the cc crate targeting sm_120 (RTX 5090) and links cudart.
A gated integration test runs the vector-add kernel on the GPU and asserts the
result. When nvcc is absent (local GPU-less machine), build.rs skips CUDA
compilation and sets a `no_cuda` cfg so host-side cargo check still works.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>