Files
xtrain/docs
Gahow Wang 4ccab0fb42 perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:

  before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
  after  (batched):    25627 tok/s (batch16) / 40263 (batch32),
                       util 37% mean / 54% peak, ~10 GB
  → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.

A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:43 +08:00
..