xtrain

Files

Gahow Wang c88e2ab88c post-train: M2 — decode primitives (rope_at + decode_attention)

Two forward-only Tensor primitives the KV-cache decode engine is built on,
each gated by an isolated correctness test:

- rope_at(theta, pos0): RoPE at an absolute position (pos = pos0 + row, no
  modulo) for a single decode token, vs the training rope_k (pos = row %
  period) left untouched. New forward-only CUDA kernel, no training-path risk.
  Gate: bit-identical to the full-sequence rope's corresponding row.
- decode_attention(k, v, scale): single-query × cached-K/V SDPA, composed from
  the existing strided batched GEMM + plain (non-causal) softmax — no new
  kernel. Gate: equals the full causal attention's last query row (max |Δ| 6e-8).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-30 12:00:03 +08:00

attention.cu

autograd: batch dim for ops (flatten linears, batched attention)

2026-06-16 00:44:15 +08:00

cast.cu

cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels

2026-06-16 14:14:39 +08:00

dropout.cu

dropout: device RNG kernel + Tensor fwd/bwd (T18)

2026-06-18 00:05:18 +08:00

elementwise.cu

tensor: add scale elementwise CUDA kernel + FFI