3 Commits

Author SHA1 Message Date
8070c1949a perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:53:52 +08:00
b0e397ca81 perf: GPU AdamW + grad-norm
Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.

- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
  (block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
  each param's .grad() and rewriting the param buffer in place — no host
  roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
  comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:53:09 +08:00
f22429f5b8 optim: hand-written AdamW (decoupled weight decay + bias correction)
New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.

Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 16:28:23 +08:00