xtrain

Files

Gahow Wang b0e397ca81 perf: GPU AdamW + grad-norm

Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.

- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
  (block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
  each param's .grad() and rewriting the param buffer in place — no host
  roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
  comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:53:09 +08:00

elementwise.cu

tensor: add scale elementwise CUDA kernel + FFI

2026-06-15 15:13:06 +08:00

gemm.cu

gemm: tiled F32 forward + transpose + backward (dA/dB)

2026-06-15 15:26:51 +08:00

model.cu

ops: embedding/reshape/transpose/split-merge-heads fwd+bwd

2026-06-15 16:05:09 +08:00

nn.cu

ops: transformer op fwd/bwd CUDA kernels + Tensor wrappers

2026-06-15 15:44:09 +08:00

optim.cu

perf: GPU AdamW + grad-norm

2026-06-15 16:53:09 +08:00