Commit Graph

4 Commits

Author SHA1 Message Date
84092fb28d docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11)
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:

  - grad all-reduce is only ~6-7% of each step;
  - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
    SAME per-rank workload) and dominates;
  - coalescing the ~150 per-tensor all-reduces into one grouped/flat
    launch gives ~0 scaling gain AND breaks cross-rank bit-identity
    (max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
    the coalescing commit (b8b5821) was reverted.

Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:40:45 +08:00
4ccab0fb42 perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:

  before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
  after  (batched):    25627 tok/s (batch16) / 40263 (batch32),
                       util 37% mean / 54% peak, ~10 GB
  → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.

A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:43 +08:00
d2a585c5cb docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:

  global_batch 32 (8/rank)  → 3163 tok/s
  global_batch 256 (64/rank)→ 3200 tok/s   (8× batch, +1.2%, within noise)

8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:20:26 +08:00
c87a0bc44e docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:56:58 +08:00