Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:
before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
after (batched): 25627 tok/s (batch16) / 40263 (batch32),
util 37% mean / 54% peak, ~10 GB
→ single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.
A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>