xtrain

Author	SHA1	Message	Date
Gahow Wang	d422c68704	docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky) The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the original ungrouped all-reduce is itself run-to-run nondeterministic on this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}), so the T8 test's `max\|p0-p1\| == 0.0` assertion is flaky here (passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP, numerically benign; loss-match stays ~6e-7). Noted as a follow-up to loosen the assertion to a tight tolerance; coalescing was reverted purely because it gives ~0 scaling benefit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:42:13 +08:00
Gahow Wang	84092fb28d	docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11) T11 set out to coalesce/overlap the gradient all-reduce per the original KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch 32, seq 256) falsifies that hypothesis: - grad all-reduce is only ~6-7% of each step; - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the SAME per-rank workload) and dominates; - coalescing the ~150 per-tensor all-reduces into one grouped/flat launch gives ~0 scaling gain AND breaks cross-rank bit-identity (max\|p0-p1\| 0.0 → 1.49e-8), violating the T8 correctness gate — so the coalescing commit (`b8b5821`) was reverted. Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy at a time; CPU not starved; per-thread default stream doesn't help): single-process thread-per-GPU ranks serialize on the single CUDA context's per-op cudaMalloc / driver calls. Fix direction (out of T11 scope): a caching/pool allocator, or process-per-GPU. Recorded in docs/known-issues.md with the measured table; KI-5 stays Open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:40:45 +08:00
Gahow Wang	4ccab0fb42	perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x) Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling") FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU, back-to-back A/B: before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB after (batched): 25627 tok/s (batch16) / 40263 (batch32), util 37% mean / 54% peak, ~10 GB → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%. A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x. The v3 falsification history (larger batch doesn't help a single-seq design) is kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching exposes (eager all-reduce of all params each step) → recorded as KI-5 (bucketed/overlapped all-reduce), out of T10 scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:43 +08:00
Gahow Wang	d2a585c5cb	docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling v3 tested the documented mitigation (raise global_batch to amortize the per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L, seq256: global_batch 32 (8/rank) → 3163 tok/s global_batch 256 (64/rank)→ 3200 tok/s (8× batch, +1.2%, within noise) 8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is launch-bound: the single-sequence model design (each sequence its own tiny forward/backward, per-op kernel launches) starves the GPU, and batching only adds proportionally more serial launches. Real fix is batched (multi-sequence) forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob. Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is fixed). KI-1 kept Open with the corrected root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 23:20:26 +08:00
Gahow Wang	c87a0bc44e	docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1 single-GPU. Root cause + proposed fixes recorded; also consolidates deferred T7 items (bf16, activation recompute) and the large-vocab modeling note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:56:58 +08:00

5 Commits