New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.
Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>