xtrain

gahow/xtrain

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	f85bd4d276	perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8 Device caching/pool allocator removes the per-op cudaMalloc serialization that was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX 5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s): single-GPU: 40226 -> 92638 tok/s (~2.3x) DDP scaling (global batch 32*world): world before after 1 39801 1.00x 92385 1.00x 2 47229 1.19x 146821 1.59x 4 52854 1.33x 269867 2.92x 8 48996 1.23x 461270 4.99x 8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it. Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the ~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever if v4 needs higher linearity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:15:02 +08:00
Gahow Wang	4c3f332f64	docs: Phase T11 — caching allocator Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size classes, per-device + thread-safety, Drop->return, transparency/correctness argument, why skip-memset uninit is deferred, dual verification gates). Before/ after numbers filled after dash5 measurement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00

Author

SHA1

Message

Date

Gahow Wang

f85bd4d276

perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8

Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):

  single-GPU: 40226 -> 92638 tok/s  (~2.3x)
  DDP scaling (global batch 32*world):
    world  before        after
      1    39801 1.00x    92385 1.00x
      2    47229 1.19x   146821 1.59x
      4    52854 1.33x   269867 2.92x
      8    48996 1.23x   461270 4.99x

8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.

Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 11:15:02 +08:00

Gahow Wang

4c3f332f64

docs: Phase T11 — caching allocator

Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size
classes, per-device + thread-safety, Drop->return, transparency/correctness
argument, why skip-memset uninit is deferred, dual verification gates). Before/
after numbers filled after dash5 measurement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 11:04:11 +08:00

2 Commits