Files

Gahow Wang d2a585c5cb docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling

v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:

  global_batch 32 (8/rank)  → 3163 tok/s
  global_batch 256 (64/rank)→ 3200 tok/s   (8× batch, +1.2%, within noise)

8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-15 23:20:26 +08:00

3.2 KiB

Raw Blame History

xtrain — Known Issues & Perf Backlog

已知问题（性能 / 正确性 / 建模）与延后项的活文档：记录现象、复现、根因、拟修复、优先级、状态。发现即记，修复即标 FIXED（附 commit）。

Open

KI-1 · DDP 弱扩展性（吞吐受单序列 launch-bound 限制）— `P1` · 由 v2 暴露，v3 重新诊断

现象：4 卡 DDP 仅 ~3.2K tok/s，几乎不快于单卡（≈2× over 单卡，远低于近线性；T8 在 tiny micro-bench 为 3.0×@4）。
复现：dim384/12L, world=4, seq 256。

v3 实测（dash5, 4× RTX 5090, dim384, 隔离 back-to-back A/B）：

global_batch	每卡	tok/s（4卡）	GPU util	显存
32	8	3163	5–69%（spiky）	~2–3 GB / 32 GB
256	64	3200	0–15%	~2–3 GB / 32 GB
→ 加大 8× batch 仅 +1.2% 吞吐（噪声内）。1 卡 dim384 ≈ 1653 tok/s，4 卡 3163 ≈ 2.1×。

原"拟修复"（加大 global batch）经 v3 实测 falsified：gbatch256 时每 token 的 all-reduce 次数只有 gbatch32 的 1/8，若瓶颈是 all-reduce 应大幅提速——实际没有 → all-reduce / 通信不是瓶颈。
重新诊断的根因：瓶颈是单序列模型设计（T5：每个 sequence 各跑一次独立 forward/backward，逐 op kernel-launch 开销，见 docs/06 延迟瓶颈）。GPU util 仅 0–15%、显存仅占 ~8% → 严重 launch-bound / under-utilized；GEMM 太小喂不饱 GPU。加大 batch 只是按比例增加串行 launch 次数，无法摊薄。4 卡相对单卡 ~2× 的固定天花板来自跨 rank 同步税，但不是靠调 batch 能修的。
真正的修复（需实作，非调参）：
1. batched（多序列）forward——把一个 step 的多条序列在 batch 维一次性过模型，让 GEMM 大到能填满 GPU（这是 launch-bound 的根本解，但要改 T4/T5 的 single-sequence autograd/model，工作量大、有正确性风险）；
2. 在 (1) 之后，梯度 all-reduce 分桶 + 与 backward 重叠（bucketed / overlapped all-reduce）才会有意义（当前 all-reduce 已非瓶颈，做了也无收益）。
参考：docs/07-distributed.md、docs/06-performance.md。

Deferred（来自 T7，放大后重启）

KI-2 · bf16 混合精度（fp32 master）— `deferred`

T7 延后理由：tiny 规模延迟瓶颈、bf16 改变数值会威胁 fp32 正确性闸门。
重启条件：模型放大（v2+ dim≥384）后 GEMM 渐成 compute-bound，tensor-core 收益显现。需 fp32 master weights + 单独 looser-tol 测试 + 收敛对比。

KI-3 · 激活重计算（gradient checkpointing）— `deferred`

T7 延后理由：单序列、显存不紧。
重启条件：更大模型 / 更长 seq / 更大 batch 后显存成约束。

Modeling notes

KI-4 · 大词表 embedding 占比过高

gpt2 vocab=50257 在 dim 小时让 embed+lm_head 主导参数：v1 25.7M/34M、v2 38.6M/66.9M；core transformer 才是学习主体。
后续可考虑更贴合 TinyStories 的小 vocab（会牺牲 xserv gpt2-tokenizer 复用）；或在更大 dim 下让 core 自然成为主体（继续 scaling 即可缓解占比）。

3.2 KiB Raw Blame History Unescape Escape

xtrain — Known Issues & Perf Backlog

Open

KI-1 · DDP 弱扩展性（吞吐受单序列 launch-bound 限制）— P1 · 由 v2 暴露，v3 重新诊断

Deferred（来自 T7，放大后重启）

KI-2 · bf16 混合精度（fp32 master）— deferred

KI-3 · 激活重计算（gradient checkpointing）— deferred

Modeling notes

KI-4 · 大词表 embedding 占比过高

3.2 KiB

Raw Blame History

KI-1 · DDP 弱扩展性（吞吐受单序列 launch-bound 限制）— `P1` · 由 v2 暴露，v3 重新诊断

KI-2 · bf16 混合精度（fp32 master）— `deferred`

KI-3 · 激活重计算（gradient checkpointing）— `deferred`