perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16, batch32 seq256, steady-state): - Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL — fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within tol", exactly equal). Full suite green with recompute on/off; DDP loss-match 5.67e-7; DDP+recompute 2-rank descends 11.079→6.010. - dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s 39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band). - dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607 MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges. → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed. Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -74,15 +74,17 @@
|
||||
|
||||
- 同 config 测 steady-state tok/s recompute on vs off,报告慢多少(预计 ~20–35%,多一次前向)。
|
||||
|
||||
## 实测结果(dash5 1× RTX 5090 32GB, dim768/18L/24h×32 ffn2048 seq256, bf16, steady-state)
|
||||
## 实测结果(dash5 1× RTX 5090 32GB, bf16, batch32 seq256, steady-state)
|
||||
|
||||
> 待 dash5 实跑回填。
|
||||
**正确性(exact,硬闸门)**:on-vs-off 梯度对拍 —— **fp32 与 bf16 双路都逐位一致**:loss rel `0.00e0`、logits max rel `0.00e0`、**每个参数梯度 max rel `0.00e0`**(不是「在容差内」,是逐位相同——证实重算确实精确)。全套回归开/关重计算全绿:T4 15 算子 grad-check、5 结构、batched、bf16、overfit、AdamW(GPU+host)、GEMM、checkpoint roundtrip、**T8 DDP loss 对单卡 5.67e-7 + 跨 rank 0.0**;DDP+recompute 2 卡短训 loss 单调降(11.079→6.010)。
|
||||
|
||||
**显存 + 吞吐**(dim768 = 18L/24h×32/ffn2048 core 127M;dim1024 = 18L/32h×32/ffn2730 core 226M):
|
||||
|
||||
| config | per-rank batch | 峰值显存 | tok/s | fits 32GB? |
|
||||
|---|---|---|---|---|
|
||||
| dim768 recompute **off** | 32 | TBD | TBD | ✅ |
|
||||
| dim768 recompute **on** | 32 | **TBD(↓)** | TBD(↓~xx%) | ✅ |
|
||||
| **dim1024** recompute **off** | 32 | — | — | ❌ **OOM** |
|
||||
| **dim1024** recompute **on** | 32 | **TBD** | TBD | ✅ **解 OOM** |
|
||||
| dim768 recompute **off** | 32 | 31144 MiB | 39.7K | ✅ |
|
||||
| **dim768 recompute on** | 32 | **14562 MiB(−53%)** | **31.5K(−20%)** | ✅ |
|
||||
| **dim1024** recompute **off** | 32 | 32100 → **OOM** | — | ❌ **OOM** |
|
||||
| **dim1024 recompute on** | 32 | **16596 MiB** | 23.1K | ✅ **解 OOM** |
|
||||
|
||||
**正确性**:on-vs-off 梯度 max rel = TBD(fp32)/ TBD(bf16);全套回归 + xserv 闭环全绿。
|
||||
→ dim768:重计算把峰值显存 **31.1→14.6GB(−53%,~砍半激活)**,代价 tok/s **−20%**(多一次前向,落在预测 20–35% 区间)。dim1024 batch32:**不开 OOM(撞 32100/32607MiB 上限)→ 开了 16.6GB 稳稳装下**,~23K tok/s 训练正常收敛 —— **KI-3 的目标达成,dim1024 解锁**。
|
||||
|
||||
@@ -7,12 +7,30 @@
|
||||
|
||||
## Open
|
||||
|
||||
_(KI-1 fixed in T10. KI-5 fixed in T11. **KI-2(bf16 混合精度)FIXED in T12**——fp32 master + bf16 linears/激活,dim768 fp32-batch32 OOM 解除(bf16 fits, 31.1/32.6GB),bf16-vs-fp32 同 batch 显存 −29% / tok/s +13%,收敛对住 fp32。见下方 Fixed。)_
|
||||
_(KI-1 fixed in T10. KI-5 fixed in T11. KI-2 fixed in T12. **KI-3(激活重计算 / gradient checkpointing)FIXED in T13**——per-block checkpoint(no-tape forward + recompute-on-backward),梯度对非重计算版**逐位一致**(fp32/bf16 max rel 0.00e0);dim768 bf16 batch32 峰值显存 31.1→14.6GB(−53%)/ tok/s −20%;**dim1024 batch32 不开 OOM → 开了 16.6GB 装得下**,解锁 v8。见下方 Fixed。)_
|
||||
|
||||
---
|
||||
|
||||
## Fixed
|
||||
|
||||
### KI-3 · 激活重计算(gradient checkpointing)— `FIXED` (T13)
|
||||
- **触发点(v8 surfaced)**:容量轴放大到 dim1024(核心 ~210M+)测是否 capacity-limited。autograd tape 为反向保存所有中间激活,激活显存随 dim 线性增长——dim768 bf16 batch32 已 31.1GB(T12 甜点区),**dim1024 batch32 再次 OOM**(实测撞 32100/32607MiB → `OutOfMemory`)。
|
||||
- **设计(per-block gradient checkpointing,opt-in,[docs/12-activation-recompute.md](12-activation-recompute.md))**:新增 `xtrain_autodiff::checkpoint(segment_fn, input, params)` 高阶原语(类比 `torch.utils.checkpoint`)。**前向**:把 input/params detach 成局部 leaf 跑 `segment_fn`,只取输出值,局部 tape 立即 drop → 段内激活释放(不留在外层 tape);checkpoint 节点 parents=[input, ..params]。**反向**:从保存的 input + 未变的 param 值重跑 `segment_fn` 重建局部 tape,用上游 grad seed(`Var::backward_seeded`,新增——段输出非标量)回传,恢复的 input/param 梯度 push 给真 parents,局部 tape drop → 重算激活释放。模型每个 transformer block 前向用它包裹(`--recompute` flag,默认关)。切粒度 = 每 block。
|
||||
- **正确性(exact,硬闸门全绿)**:重计算数学精确(同 `segment_fn`、同输入、同参数值、确定性 kernel → 重算激活逐位等于原激活)。**on-vs-off 梯度对拍 fp32/bf16 双路逐位一致**:loss rel `0.00e0`、logits max rel `0.00e0`、**每个参数梯度 max rel `0.00e0`**(不是容差内,是逐位)。全套回归开/关重计算全绿:autograd 15、structural 5、batched、bf16、overfit 27/27、AdamW(GPU bit-exact + host 对 torch)、checkpoint roundtrip、**DDP loss 对单卡 5.67e-7 + 跨 rank 0.0**;DDP+recompute 2 卡短训单调降(11.079→6.010)。非重计算路径图不变(默认关)→ T10/T11/T12 数值不回归。
|
||||
- **显存 + 吞吐(payoff,dash5 1× RTX 5090 32GB, bf16, batch32 seq256, steady-state)**:
|
||||
|
||||
| config | per-rank batch | 峰值显存 | tok/s | fits 32GB? |
|
||||
|---|---|---|---|---|
|
||||
| dim768 (18L/24h ffn2048, core 127M) off | 32 | 31144 MiB | 39.7K | ✅ |
|
||||
| **dim768 on** | 32 | **14562 MiB(−53%)** | **31.5K(−20%)** | ✅ |
|
||||
| **dim1024** (18L/32h ffn2730, core 226M) off | 32 | 32100 → **OOM** | — | ❌ **OOM** |
|
||||
| **dim1024 on** | 32 | **16596 MiB** | 23.1K | ✅ **解 OOM** |
|
||||
|
||||
→ dim768:重计算砍 ~半激活(**31.1→14.6GB,−53%**),代价 tok/s **−20%**(多一次前向,落在预测 20–35%)。**dim1024 batch32:不开 OOM → 开了 16.6GB 稳稳装下**,~23K tok/s 正常收敛 → **dim1024 解锁,v8 可展开**。
|
||||
- **commit**:T13 提交链(`autodiff: checkpoint primitive (recompute-on-backward)` / `model: per-block activation recompute (--recompute)` / `perf: KI-3 fixed …` 本条带 before/after / 文档 `docs: Phase T13 — activation recompute`)。
|
||||
|
||||
---
|
||||
|
||||
### KI-2 · bf16 混合精度(fp32 master)— `FIXED` (T12)
|
||||
- **触发点(v4 surfaced)**:dim768 fp32 在单卡 32GB 里 per-rank batch 32(global 256)OOM,被迫降到 per-rank 16。bf16(激活减半)找回 batch-32 甜点区,并加速已 compute-bound 的 dim768 GEMM;附带:xserv 推理 BF16-only,bf16 训练更贴闭环。
|
||||
- **设计(标准 AMP,opt-in,[docs/11-bf16-mixed-precision.md](11-bf16-mixed-precision.md))**:**fp32 master weights** + AdamW/clip/DDP all-reduce 全程 fp32;**bf16 compute**=linears(q/k/v/o, gate/up/down, lm_head)走 `cublasGemmEx`(CUDA_R_16BF in/out, CUBLAS_COMPUTE_32F 累加)+ 激活流 bf16(含 attention probs / logits);**fp32 稳定**=RMSNorm/QK-norm、softmax、RoPE、cross-entropy 内部 upcast→fp32→downcast。**无 loss scaling**(bf16 8-bit 指数=fp32 动态范围)。关键钩子=autodiff `cast` 算子:前向把 fp32 master leaf 降到 bf16 喂 matmul,**反向把 bf16 grad 升回 fp32** → fp32 leaf 累加 fp32 grad,优化器一行不改。fp32 路径按 dtype 分派、逐字节不变(hard gate)。
|
||||
@@ -112,9 +130,7 @@ _(KI-1 fixed in T10. KI-5 fixed in T11. **KI-2(bf16 混合精度)FIXED in T1
|
||||
|
||||
## Deferred(来自 T7,放大后重启)
|
||||
|
||||
### KI-3 · 激活重计算(gradient checkpointing)— `deferred`
|
||||
- T7 延后理由:单序列、显存不紧。
|
||||
- **重启条件**:更大模型 / 更长 seq / 更大 batch 后显存成约束。
|
||||
_(KI-3 已在 T13 FIXED,见上方 Fixed。本节暂无待启项。)_
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user