xtrain

Files

Gahow Wang a842e432b5 perf: streams / drop per-op sync

Default-stream kernels run in order and every host read goes through a
stream-ordered cudaMemcpy (to_device), so the per-op cudaDeviceSynchronize
after each kernel was pure overhead — remove all 21 in tensor.rs. Host
data is still correctly ordered by the D2H memcpy that reads it.

Also zero op-output buffers with cudaMemset (device-side, async) instead of
a blocking H2D memcpy of a host zero buffer on every allocation — that
copy was itself a hidden per-op sync point.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 16:56:17 +08:00

xtrain-autodiff

ops: grad-check the T5 structural ops

2026-06-15 16:05:20 +08:00

xtrain-cuda

perf: streams / drop per-op sync

2026-06-15 16:56:17 +08:00

xtrain-model

model: silence torch parity warning (read loss before backward)

2026-06-15 16:09:30 +08:00

xtrain-optim

perf: make xtrain-cuda a regular dep of xtrain-optim (GPU AdamW)

2026-06-15 16:53:52 +08:00

xtrain-tensor

perf: streams / drop per-op sync

2026-06-15 16:56:17 +08:00

xtrain-train

perf: GPU AdamW + grad-norm

2026-06-15 16:53:09 +08:00