Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):
single-GPU: 40226 -> 92638 tok/s (~2.3x)
DDP scaling (global batch 32*world):
world before after
1 39801 1.00x 92385 1.00x
2 47229 1.19x 146821 1.59x
4 52854 1.33x 269867 2.92x
8 48996 1.23x 461270 4.99x
8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.
Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>