Files
xtrain/crates
Gahow Wang 28801fbfe5 cuda: device caching allocator (pool GpuBuffer alloc)
Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc ->
cudaMalloc, a synchronous process-serialized driver call. Under the single-
process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs
serialize through the driver (KI-5 root cause); it costs single-GPU too.

Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a
free-list (request rounded up to a size class so repeating training shapes
reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool
instead of cudaFree. Thread-safe via a global registry keyed by device id with
each device's free-list behind its own Mutex (registry lock held only to clone
out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices).
The buffer records its alloc-time device so Drop returns to the right pool.

Transparent: physical capacity may be rounded up, but len()/memset/copy bounds
all use the requested length, so the rounded tail is never read and numerics are
unchanged. zeros() still memsets (reused buffers hold stale bytes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:04:02 +08:00
..
2026-06-16 00:44:25 +08:00