The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the
original ungrouped all-reduce is itself run-to-run nondeterministic on
this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7,
1.19e-7}), so the T8 test's `max|p0-p1| == 0.0` assertion is flaky here
(passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP,
numerically benign; loss-match stays ~6e-7). Noted as a follow-up to
loosen the assertion to a tight tolerance; coalescing was reverted purely
because it gives ~0 scaling benefit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:
- grad all-reduce is only ~6-7% of each step;
- per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
SAME per-rank workload) and dominates;
- coalescing the ~150 per-tensor all-reduces into one grouped/flat
launch gives ~0 scaling gain AND breaks cross-rank bit-identity
(max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
the coalescing commit (b8b5821) was reverted.
Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:
before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
after (batched): 25627 tok/s (batch16) / 40263 (batch32),
util 37% mean / 54% peak, ~10 GB
→ single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.
A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:
global_batch 32 (8/rank) → 3163 tok/s
global_batch 256 (64/rank)→ 3200 tok/s (8× batch, +1.2%, within noise)
8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>