docs: T15 GQA results + evolution row (模型架构) + README build-journey row
Backfill docs/14-gqa.md gate table (dash5 numbers); add T15 evolution row + cumulative 模型架构 line; README build-journey T15 row + Phase 2 prose + doc index range (00..14). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -51,6 +51,7 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
|
|||||||
| **T12** | **bf16 mixed precision** (fp32 master, fixes KI-2) | dim768 OOM solved; −29% mem |
|
| **T12** | **bf16 mixed precision** (fp32 master, fixes KI-2) | dim768 OOM solved; −29% mem |
|
||||||
| **T13** | **activation recompute** / checkpointing (fixes KI-3) | dim1024 fits; grads bit-identical |
|
| **T13** | **activation recompute** / checkpointing (fixes KI-3) | dim1024 fits; grads bit-identical |
|
||||||
| **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) |
|
| **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) |
|
||||||
|
| **T15** | **grouped-query attention** (`num_kv_heads<num_heads`; `repeat_kv` broadcast feeds both SDPA paths; backward sums each kv head's group; `--kv-heads`) | repeat_kv grad-check + **group=1 bit-identical to MHA**; GQA flash==composed; PyTorch GQA B>1; **xserv closed loop with real `num_key_value_heads`** token-identical |
|
||||||
| **T16** | **gradient accumulation** (`--accum-steps`; DDP all-reduces only at the boundary) | equiv to N× big batch (grad 3.8e-5); same effective-64 batch 27.7GB→7.2GB (−74%) |
|
| **T16** | **gradient accumulation** (`--accum-steps`; DDP all-reduces only at the boundary) | equiv to N× big batch (grad 3.8e-5); same effective-64 batch 27.7GB→7.2GB (−74%) |
|
||||||
| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |
|
| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |
|
||||||
|
|
||||||
@@ -58,6 +59,9 @@ The four performance fixes (T10–T13) each removed a real bottleneck — see
|
|||||||
[`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14–)**
|
[`docs/known-issues.md`](docs/known-issues.md). **Phase 2 (systems-stack depth, T14–)**
|
||||||
revisits hand-writing deferred training-stack features: T14 = the fused
|
revisits hand-writing deferred training-stack features: T14 = the fused
|
||||||
flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md));
|
flash-attention kernel ([`docs/13-flash-attention.md`](docs/13-flash-attention.md));
|
||||||
|
T15 = real grouped-query attention ([`docs/14-gqa.md`](docs/14-gqa.md), `num_kv_heads <
|
||||||
|
num_heads` via a `repeat_kv` broadcast op whose backward sums each kv head's query-head
|
||||||
|
group — feeding both SDPA paths unchanged, default MHA bit-identical);
|
||||||
T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
|
T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
|
||||||
which decouples the effective batch from activation memory (memory tracks the micro-batch,
|
which decouples the effective batch from activation memory (memory tracks the micro-batch,
|
||||||
not N×); T18 = dropout ([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based
|
not N×); T18 = dropout ([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based
|
||||||
@@ -145,5 +149,5 @@ cargo test --workspace # autograd grad-checks, PyTorch parity, DDP, e
|
|||||||
|
|
||||||
- [`docs/evolution.md`](docs/evolution.md) — per-milestone changes across algorithm / architecture / infra / dataset.
|
- [`docs/evolution.md`](docs/evolution.md) — per-milestone changes across algorithm / architecture / infra / dataset.
|
||||||
- [`docs/runs/README.md`](docs/runs/README.md) — the v0–v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) — per-run detail.
|
- [`docs/runs/README.md`](docs/runs/README.md) — the v0–v8 comparison; [`docs/runs/0N-*.md`](docs/runs/) — per-run detail.
|
||||||
- [`docs/00-*` … `12-*`](docs/) — per-phase design docs (build chain → tensor → autograd → transformer → training → perf → distributed → export → batched → allocator → bf16 → recompute).
|
- [`docs/00-*` … `14-*`](docs/) — per-phase design docs (build chain → tensor → autograd → transformer → training → perf → distributed → export → batched → allocator → bf16 → recompute → flash-attention → GQA).
|
||||||
- [`docs/known-issues.md`](docs/known-issues.md) — perf backlog (KI-1/2/3/5 fixed; KI-4 + process-per-GPU open).
|
- [`docs/known-issues.md`](docs/known-issues.md) — perf backlog (KI-1/2/3/5 fixed; KI-4 + process-per-GPU open).
|
||||||
|
|||||||
@@ -162,6 +162,19 @@ broadcast op,fwd/bwd 各一发 kernel,最简且能单独 grad-check。
|
|||||||
**对 xtrain 自身逐 token 一致**(BF16 推理 vs f32 训练,与 v1–v8 同款判据)。这是 GQA 真正落地的证明:
|
**对 xtrain 自身逐 token 一致**(BF16 推理 vs f32 训练,与 v1–v8 同款判据)。这是 GQA 真正落地的证明:
|
||||||
训练侧的分组、导出的分组、推理侧 xserv 的 repeat_kv 分组三方对齐。
|
训练侧的分组、导出的分组、推理侧 xserv 的 repeat_kv 分组三方对齐。
|
||||||
|
|
||||||
## 实测结果(dash5)
|
## 实测结果(dash5 1× / 2× RTX 5090)
|
||||||
|
|
||||||
> 待 dash5 实跑回填(gate 表 + 数字)。
|
**硬闸门全绿:**
|
||||||
|
|
||||||
|
| 闸门 | 结果 |
|
||||||
|
|---|---|
|
||||||
|
| ① repeat_kv grad-check(**多组 q 头梯度求和到一个 kv 头**,group=3) | **过** — din max_rel **2.05e-4**;group=1 identity 双向**逐位**(fwd/bwd |Δ|=0) |
|
||||||
|
| GQA flash==composed(model 级 8h/2kv,logits/loss/每参数梯度) | fp32: loss rel **0.0**、logits 3.0e-4、grad **4.1e-5**;bf16: loss 9.0e-5、logits mean 2.9e-3/p99 1.0e-2、grad scaled-mean 8.9e-3 |
|
||||||
|
| group=1 对 MHA**逐位一致**(回归保护) | **过** — logits + loss + 全部梯度 |Δ|=0 |
|
||||||
|
| ② PyTorch GQA 对拍 B>1(composed & flash,repeat_interleave 分组对齐) | composed: loss **1.74e-8**/logits 2.04e-5/25 grad 进 rtol;flash: loss 1.74e-8/logits 2.28e-5/25 grad 进 rtol |
|
||||||
|
| ③ 小 GQA 配置短训收敛(8h/2kv/hd32/4L/ffn1024,600 步) | train **10.90→3.15** 无 NaN、gnorm 稳 ~1.2、采样连贯英文(~200K tok/s) |
|
||||||
|
| ④ **xserv 闭环真 GQA**(导出 `num_key_value_heads=2 < num_attention_heads=8`,xserv 加载 `heads=8/2 kv`,贪心) | "One day"/"The little" 两 prompt **逐 token 一致**;"Once upon a time" 在 `...Lily's mommy ` 处 BF16 漂移晚分叉(said vs came)——与 v1/v2/v3/T14 同款判据 |
|
||||||
|
| ⑤ 回归套:autograd 23(含 repeat_kv 2)/ structural 5 / batched / bf16 / flash 2 / **gqa 4** / overfit 27/27 / recompute 2 / dropout 6 / grad_accum 3 / checkpoint-roundtrip / AdamW(host 对 torch 4.8e-6) / DDP 3(`--test-threads=1`, loss 5.67e-7+跨 rank 一致) / GEMM / tensor | **全绿** |
|
||||||
|
| ⑤ MHA 默认 export md5(v3 ckpt 用 T15 代码重导 safetensors) | **逐位一致** `b04fc9f9a0c9af04c47d9ca649aea12e`(与 registry/T14 同)→ 默认(kv=heads)export 零漂移 |
|
||||||
|
|
||||||
|
> **诚实记录**:闭环 2/3 prompt 完全 token-identical、1/3 在 BF16 漂移点晚分叉——这恰证明 GQA 分组**正确**:若 kv→q 头映射错,attention 会从第一个生成 token 起就崩(不会是深处近-tie 的晚分叉)。GQA 把 K/V 在显存里物化成满头 `[B·nh,S,hd]`(broadcast-op 方案的代价)——本规模可接受,kernel-内 GQA(省这份显存)留 follow-up。未为凑绿放宽任何容差。
|
||||||
|
|||||||
@@ -25,6 +25,7 @@
|
|||||||
| T12 | 算法/Infra | **bf16 混合精度**(fp32 master,cuBLAS GemmEx,norm/softmax/CE 保 fp32) | dim768 OOM 解除,−29% 显存/+13% tok/s(修 KI-2) |
|
| T12 | 算法/Infra | **bf16 混合精度**(fp32 master,cuBLAS GemmEx,norm/softmax/CE 保 fp32) | dim768 OOM 解除,−29% 显存/+13% tok/s(修 KI-2) |
|
||||||
| T13 | 算法/Infra | **激活重计算**(per-block gradient checkpointing:前向 no-tape + 反向重算,`backward_seeded`) | 梯度对非重计算版**逐位一致**(0.00);dim768 31.1→14.6GB;**dim1024 batch32 OOM→16.6GB 装下**(修 KI-3,解锁 v8) |
|
| T13 | 算法/Infra | **激活重计算**(per-block gradient checkpointing:前向 no-tape + 反向重算,`backward_seeded`) | 梯度对非重计算版**逐位一致**(0.00);dim768 31.1→14.6GB;**dim1024 batch32 OOM→16.6GB 装下**(修 KI-3,解锁 v8) |
|
||||||
| T14 | 算法/Infra | **融合 flash-attention kernel**(手写单 kernel:online softmax、tiled over KV、**不物化 N×N scores**;flash 式 bwd:重算 scores + `D=ΣdO·O` 化简雅可比 + dQ/dK/dV);opt-in `--flash`,默认保 composed(Phase 2) | fwd 对 composed 6.7e-5、bwd 对 composed dQ 1.7e-5、PyTorch B>1 7.9e-6、flash==composed loss rel 0.0;**峰值显存 −16%@seq1024 / −23%@seq2048**(不物化 N×N,收益随 seq 增长);tok/s ~2.3–2.8× 慢(hd=64 小头维干不过 cuBLAS tensor-core,flash 已知权衡=胜场在显存);md5 闭环逐位一致 |
|
| T14 | 算法/Infra | **融合 flash-attention kernel**(手写单 kernel:online softmax、tiled over KV、**不物化 N×N scores**;flash 式 bwd:重算 scores + `D=ΣdO·O` 化简雅可比 + dQ/dK/dV);opt-in `--flash`,默认保 composed(Phase 2) | fwd 对 composed 6.7e-5、bwd 对 composed dQ 1.7e-5、PyTorch B>1 7.9e-6、flash==composed loss rel 0.0;**峰值显存 −16%@seq1024 / −23%@seq2048**(不物化 N×N,收益随 seq 增长);tok/s ~2.3–2.8× 慢(hd=64 小头维干不过 cuBLAS tensor-core,flash 已知权衡=胜场在显存);md5 闭环逐位一致 |
|
||||||
|
| T15 | 模型架构 | **真 GQA**(`num_kv_heads<num_heads`:wk/wv 投影到 `kv_dim`,新 `repeat_kv` broadcast 算子把 K/V 复制 `group=nh/num_kv` 份喂给**未改动**的 composed/flash 两条 SDPA;分组约定对齐 xserv repeat_kv `dst=kvh·group+r`);`repeat_kv` 反向=组内 group 行**确定性求和**(无 atomic)→ 多组 q 头梯度汇一个 kv 头;`num_kv_heads` 进 Config(默认=nh→MHA)、`--kv-heads` flag、导出写真 `num_key_value_heads`(Phase 2) | repeat_kv grad-check 2.1e-4(group3)+group1 identity 逐位;GQA flash==composed fp32 grad 4.1e-5/bf16 在带;**group1 对 MHA 逐位一致**(回归保护);PyTorch GQA B>1 对拍 composed/flash 各 loss 1.7e-8/logits 2.3e-5/25 grad 进 rtol;小 GQA(8h/2kv) 训 600 步 10.9→3.15 连贯;**xserv 闭环真 GQA**(num_kv 2<8):2/3 prompt token-identical、1 在 BF16 漂移处晚分叉;MHA 默认 export md5 逐位一致(b04fc9f9) |
|
||||||
| T16 | 算法/Infra | **梯度累积**(N 个 micro-step:每个 micro-loss `×1/N` 再 backward,tape SUM 累加 → 一次 AdamW step+zero;`--accum-steps`);**DDP 只在累积边界 all-reduce**(中间 micro-step 不发 NCCL,`/world` 与 `1/N` 正交);显存随 micro 不随有效 batch | 等效大 batch**逐位贴合**(loss rel 8.5e-8、grad rel 3.8e-5);`accum=1` 逐位回归(0.00);DDP+accum 对单卡 loss 5.7e-7/跨 rank 一致;**显存平**:同有效 batch 64,big-batch 27.7GB→accum(4×16) **7.2GB(−74%)**(big-batch OOM 而 accum 装下);全回归+xserv 闭环 md5 一致 |
|
| T16 | 算法/Infra | **梯度累积**(N 个 micro-step:每个 micro-loss `×1/N` 再 backward,tape SUM 累加 → 一次 AdamW step+zero;`--accum-steps`);**DDP 只在累积边界 all-reduce**(中间 micro-step 不发 NCCL,`/world` 与 `1/N` 正交);显存随 micro 不随有效 batch | 等效大 batch**逐位贴合**(loss rel 8.5e-8、grad rel 3.8e-5);`accum=1` 逐位回归(0.00);DDP+accum 对单卡 loss 5.7e-7/跨 rank 一致;**显存平**:同有效 batch 64,big-batch 27.7GB→accum(4×16) **7.2GB(−74%)**(big-batch OOM 而 accum 装下);全回归+xserv 闭环 md5 一致 |
|
||||||
| T18 | 算法 | **dropout**(手写 counter-based 设备 RNG → Bernoulli mask,训练 inverted 1/(1-p) scaling、eval 恒等);新 autodiff `dropout` 算子(fwd 生成+施加 mask,bwd 用同 mask),接 residual/ffn 两处;`--dropout` flag 默认 0 | 固定 seed grad-check 过;E[out]≈input + keep≈1-p;**p=0 与无 dropout 逐位一致**;recompute(T13) 组合下梯度仍逐位一致(counter-based seed 重算复现同 mask);全回归 + xserv 闭环绿(导出/推理 dropout 关) |
|
| T18 | 算法 | **dropout**(手写 counter-based 设备 RNG → Bernoulli mask,训练 inverted 1/(1-p) scaling、eval 恒等);新 autodiff `dropout` 算子(fwd 生成+施加 mask,bwd 用同 mask),接 residual/ffn 两处;`--dropout` flag 默认 0 | 固定 seed grad-check 过;E[out]≈input + keep≈1-p;**p=0 与无 dropout 逐位一致**;recompute(T13) 组合下梯度仍逐位一致(counter-based seed 重算复现同 mask);全回归 + xserv 闭环绿(导出/推理 dropout 关) |
|
||||||
|
|
||||||
@@ -53,7 +54,7 @@
|
|||||||
## 三、各维度的累积演进(轴向看一条线怎么走的)
|
## 三、各维度的累积演进(轴向看一条线怎么走的)
|
||||||
|
|
||||||
- **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14,online softmax + flash 式 bwd) → 梯度累积(T16,复用 tape SUM,等效大 batch 而显存随 micro) → dropout(T18,counter-based 设备 RNG + inverted scaling,train/eval 切换)。
|
- **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14,online softmax + flash 式 bwd) → 梯度累积(T16,复用 tape SUM,等效大 batch 而显存随 micro) → dropout(T18,counter-based 设备 RNG + inverted scaling,train/eval 切换)。
|
||||||
- **模型架构**:固定 Qwen3-style;dim **32→256→384→512→768→1024**(v8 首拨容量轴,头数 24→32);核心参数 **41K→226M**(总 3.26M→329M)。
|
- **模型架构**:固定 Qwen3-style;dim **32→256→384→512→768→1024**(v8 首拨容量轴,头数 24→32);核心参数 **41K→226M**(总 3.26M→329M)。+QK-norm(T9,Qwen3 兼容) → **真 GQA(T15,`num_kv_heads<num_heads`,repeat_kv broadcast + 组内梯度求和;默认=nh→MHA 逐位回归)**——架构补齐到现代 LLM 标配(MHA/GQA/MQA 一条 `num_kv_heads` 轴),两条 SDPA(composed/flash) 共用同一 broadcast,导出真 `num_key_value_heads` 且 xserv 闭环。
|
||||||
- **Infra**:单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13,解锁 dim1024) → flash-attention(T14,不物化 N×N,attention 显存收益随 seq 增长) → 梯度累积(T16,DDP 只在累积边界通信,显存随 micro 不随有效 batch)。吞吐 **3.3K→217K tok/s**(dim768 bf16),dim1024+重算 ~129K(重算税);MFU **0.4%→17%**(每次提升都对应一块 perf 基建,详见 known-issues + MFU 分析)。T13/T14/T16 是三条**显存杠杆**(重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存),可叠加放大有效 batch。
|
- **Infra**:单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13,解锁 dim1024) → flash-attention(T14,不物化 N×N,attention 显存收益随 seq 增长) → 梯度累积(T16,DDP 只在累积边界通信,显存随 micro 不随有效 batch)。吞吐 **3.3K→217K tok/s**(dim768 bf16),dim1024+重算 ~129K(重算税);MFU **0.4%→17%**(每次提升都对应一块 perf 基建,详见 known-issues + MFU 分析)。T13/T14/T16 是三条**显存杠杆**(重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存),可叠加放大有效 batch。
|
||||||
- **数据集**:TinyStories 3MB 切片 → 全量 TinyStories(epoch 0.01→5.33,**至饱和**)→ **v6 毕业到 FineWeb-edu 真实网页**(2.255B 语料,1.02ep)→ **v7 同子集多 epoch(1.45ep,近顶)→ v8 同子集换大模型**(dim1024,1.05ep)。tokenizer 全程 gpt2 BPE(复用 xserv-tokenizer;v6 刻意不换 tokenizer 以隔离「数据来源」变量,KI-4 留后续版本)。
|
- **数据集**:TinyStories 3MB 切片 → 全量 TinyStories(epoch 0.01→5.33,**至饱和**)→ **v6 毕业到 FineWeb-edu 真实网页**(2.255B 语料,1.02ep)→ **v7 同子集多 epoch(1.45ep,近顶)→ v8 同子集换大模型**(dim1024,1.05ep)。tokenizer 全程 gpt2 BPE(复用 xserv-tokenizer;v6 刻意不换 tokenizer 以隔离「数据来源」变量,KI-4 留后续版本)。
|
||||||
- **v5→v6 数据轴的质变**:v0–v5 都吃合成幼儿故事(TinyStories,低熵、词汇受控),v5 证明同尺寸模型在它上面已饱和;v6 第一版换成**真实教育类网页文本**(FineWeb-edu),语言种类发生质变——采样从「只会写小故事」变成「能写历史/科学/说明文」。
|
- **v5→v6 数据轴的质变**:v0–v5 都吃合成幼儿故事(TinyStories,低熵、词汇受控),v5 证明同尺寸模型在它上面已饱和;v6 第一版换成**真实教育类网页文本**(FineWeb-edu),语言种类发生质变——采样从「只会写小故事」变成「能写历史/科学/说明文」。
|
||||||
|
|||||||
Reference in New Issue
Block a user