xtrain

gahow/xtrain

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	8070c1949a	perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:52 +08:00
Gahow Wang	b0e397ca81	perf: GPU AdamW + grad-norm Eliminate the per-step GPU↔host roundtrip of every parameter/gradient. - optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum (block-reduced global grad sum-of-squares), scale_inplace. - GpuAdamW: device m/v state per param; step launches the kernel reading each param's .grad() and rewriting the param buffer in place — no host roundtrip. Host AdamW kept as the torch-parity reference. - clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm comes back), in-place rescale of grads by pre_scale·clip_factor. - train_loop: use GpuAdamW + clip_grad_norm_gpu. - test: GPU AdamW vs host reference parity (max abs err < 1e-6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:53:09 +08:00
Gahow Wang	f22429f5b8	optim: hand-written AdamW (decoupled weight decay + bias correction) New xtrain-optim crate. AdamW with per-param m/v moments keyed by params() index, global bias correction, and decoupled weight decay (matches torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers, unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr argument leaves room for an LR schedule. Host unit test checks the update against an independent reference recurrence over 20 steps and the pure-decay (g=0) boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 16:28:23 +08:00

Author

SHA1

Message

Date

Gahow Wang

8070c1949a

perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:53:52 +08:00

Gahow Wang

b0e397ca81

perf: GPU AdamW + grad-norm

Eliminate the per-step GPU↔host roundtrip of every parameter/gradient.

- optim.cu: adamw_step (m/v on device, in-place param update), sumsq_accum
  (block-reduced global grad sum-of-squares), scale_inplace.
- GpuAdamW: device m/v state per param; step launches the kernel reading
  each param's .grad() and rewriting the param buffer in place — no host
  roundtrip. Host AdamW kept as the torch-parity reference.
- clip_grad_norm_gpu: device sum-of-squares reduction (only the scalar norm
  comes back), in-place rescale of grads by pre_scale·clip_factor.
- train_loop: use GpuAdamW + clip_grad_norm_gpu.
- test: GPU AdamW vs host reference parity (max abs err < 1e-6).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:53:09 +08:00

Gahow Wang

f22429f5b8

optim: hand-written AdamW (decoupled weight decay + bias correction)

New xtrain-optim crate. AdamW with per-param m/v moments keyed by params()
index, global bias correction, and decoupled weight decay (matches
torch.optim.AdamW). Split into a pure-host step_host (flat f32 buffers,
unit-testable on a GPU-less host) and a step(&[Var]) wrapper that round-trips
each param value/grad through the GPU tensor (gated not(no_cuda)). Per-step lr
argument leaves room for an LR schedule.

Host unit test checks the update against an independent reference recurrence
over 20 steps and the pure-decay (g=0) boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:28:23 +08:00

3 Commits