From 71b0a1621fc0926f266418af49d38f5c304460e2 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 18:03:14 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20T17=20process-per-GPU=20results=20?= =?UTF-8?q?=E2=80=94=20measured=20throughput-neutral?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Records the key empirical finding: process-per-GPU is statistically identical to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all 8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the old KI-5/T11 note speculated — measurement falsifies that hypothesis (same methodology as T11 falsifying "bucket the all-reduce"). Correctness all green: proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5 b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op); default training path unchanged. evolution.md Infra row + README T17 row + known-issues entry. Co-Authored-By: Claude Opus 4.8 --- README.md | 13 +++++++-- docs/16-process-per-gpu.md | 57 +++++++++++++++++++++++++++++++++++--- docs/evolution.md | 8 ++++-- docs/known-issues.md | 22 ++++++++++++++- 4 files changed, 89 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 35fba11..08f9cca 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ borrows, the rest hand-written CUDA + Rust: | `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward | | `xtrain-optim` | hand-written **AdamW** (host + GPU kernels) | | `xtrain-train` | training loop, LR schedule, grad clip, checkpoint, BPE corpus + cache, samplers, safetensors export | -| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU, all-reduce) | +| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU, all-reduce) | Every op's backward is verified against **finite differences** and against **PyTorch** (forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and @@ -53,6 +53,7 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`] | **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) | | **T15** | **grouped-query attention** (`num_kv_heads1; **xserv closed loop with real `num_key_value_heads`** token-identical | | **T16** | **gradient accumulation** (`--accum-steps`; DDP all-reduces only at the boundary) | equiv to N× big batch (grad 3.8e-5); same effective-64 batch 27.7GB→7.2GB (−74%) | +| **T17** | **process-per-GPU** DDP (torchrun-style: 1 worker process / CUDA context per GPU; launcher mints `ncclUniqueId` → hex env injection; `train_rank` reused unchanged; thread-per-GPU path kept) | proc==thread loss 1.5e-7, cross-rank 1.2e-7, xserv md5 identical · **measured no-op on throughput**: thread 5.27× vs proc 5.31×@8 (8 GPUs 95–99% util) → residual non-linearity is NCCL/PCIe, *not* CUDA-context serialization (falsifies the old KI-5 hypothesis) | | **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe | The four performance fixes (T10–T13) each removed a real bottleneck — see @@ -64,8 +65,14 @@ num_heads` via a `repeat_kv` broadcast op whose backward sums each kv head's que group — feeding both SDPA paths unchanged, default MHA bit-identical); T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)), which decouples the effective batch from activation memory (memory tracks the micro-batch, -not N×); T18 = dropout ([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based -device RNG + mask, inverted scaling, train/eval switch). +not N×); T17 = torchrun-style process-per-GPU DDP +([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md), one process + CUDA context per +GPU, launcher-minted `ncclUniqueId` via env injection, reusing the T8 training step +unchanged) — which **measured** that, at this scale, separate contexts give no throughput +gain over thread-per-GPU (the residual ~5.3×@8 is the NCCL/PCIe communication wall, not +single-context serialization as the old KI-5 note speculated); T18 = dropout +([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based device RNG + mask, inverted +scaling, train/eval switch). ## The scaling study — v0 → v8 diff --git a/docs/16-process-per-gpu.md b/docs/16-process-per-gpu.md index 3946f18..47ff192 100644 --- a/docs/16-process-per-gpu.md +++ b/docs/16-process-per-gpu.md @@ -128,9 +128,12 @@ thread-per-GPU 的残留非线性(KI-5 / T11 doc)来自:**N rank 线程共 pool allocator 消掉了 malloc 这一最大笔,但 launch / cuBLAS 串行仍在,表现为 8 卡 ~5× 而非 ~8×。 process-per-GPU 下**每个 rank 是独立进程 → 独立 CUDA context → 独立 driver 状态**:各进程的 kernel -launch / cuBLAS 调用**互不在同一 context 排队**,残留串行被结构性移除。这正是闸门 ③(before→after 线性度) -要量出来的东西——若 process-per-GPU 把 8 卡从 ~5× 推到明显更高,即验证了根因诊断。**诚实原则**:若提升 -有限,如实报告(说明残留瓶颈转移到 NCCL all-reduce / PCIe 拓扑,那是另一层,非本任务 scope)。 +launch / cuBLAS 调用**互不在同一 context 排队**,残留串行(按此假设)应被结构性移除。这正是闸门 ③ +(before→after 线性度)要量出来的东西——若 process-per-GPU 把 8 卡从 ~5× 推到明显更高,即验证此假设。 +**诚实原则**:若提升有限,如实报告(说明残留瓶颈在 NCCL all-reduce / PCIe 拓扑,那是另一层,非本任务 scope)。 + +> ⚠️ **此假设被实测证伪**——见下方「实测结果 · 闸门 ③」:process-per-GPU 与 thread-per-GPU 吞吐统计上一致 +> (~5.3×@8 都一样),且 8 卡全 95–99% util。残留非线性是通信/PCIe 墙,不是单 context 串行。结论钉死、留档。 ### ④ 训练 step / 一致性论证:原样复用 T8,零改动 @@ -206,7 +209,53 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 cargo run -p xtrain-distributed --release --bin tra --dim 384 --heads 12 --head-dim 32 --layers 12 --ffn 1536 --steps 200 --batch 128 --seq 256 ``` -实测数字回填见 xtrain.md T17 note / commit。 +实测数字见下方「实测结果」。 + +## 实测结果(dash5, 8× RTX 5090, sm_120) + +### 正确性(闸门 ①②④ 全绿) + +- **闸门 ① loss 对单卡 / 对 thread 路径**(`ddp_proc`, world=2,合成语料 20 步): + proc-per-GPU vs single-GPU `max_rel = 5.67e-7`;**proc-per-GPU vs thread-per-GPU `max_rel = 1.5e-7`** + (两条路径数值同量级,符合预期——只差进程/线程,sharding+all-reduce 同)。 +- **闸门 ② 跨 rank 参数**:`max|p0−p1| = 1.19e-7`(< 1e-6,KI-5 既有 ULP 容差,PCIe NCCL run-to-run 抖动)。 +- **闸门 ④ 全回归**:全 workspace `--test-threads=1` 全绿(autograd/structural/batched/bf16/recompute/ + overfit/AdamW/既有 DDP/flash/gqa/grad_accum/dropout)+ **xserv 闭环**:v3 ckpt 用 T17 代码重导 + safetensors 与 registry **md5 逐位一致 `b04fc9f9a0c9af04c47d9ca649aea12e`**(T17 不碰任何数值路径 → 必然一致)。 + +### 闸门 ③ 线性度 before→after —— **本任务的关键发现:process-per-GPU 在本尺度对吞吐中性** + +固定每卡 batch 32 / dim384 / seq256 / 150 步(与 T11 KI-5 表同口径),steady-state tok/s: + +| world | thread-per-GPU (`train_ddp`) | speedup | process-per-GPU (`train_ddp_mp`) | speedup | +|---|---|---|---|---| +| 1 | 93257 | 1.00× | 92952 | 1.00× | +| 2 | 149747 | 1.61× | 148809 | 1.60× | +| 4 | 278276 | 2.98× | 273308 | 2.94× | +| 8 | **491360** | **5.27×** | **493128** | **5.31×** | + +(world=8 重复 2 次确认非噪声:thread 493671/493292,proc 491102/494123——**两路差异 < 1%,落在 run-to-run 噪声内**。) + +→ **process-per-GPU 与 thread-per-GPU 吞吐统计上一致(~5.3×@8 都一样)**。本 doc 设计假设 ③ +(「残留 5×@8 来自单 CUDA context 的 kernel-launch/cuBLAS 串行,process-per-GPU 给独立 context 即可移除」) +**被实测证伪**——这正是 ③ 里预留的「诚实原则」分支。 + +**根因重定位(实测佐证)**:proc-per-GPU world=8 跑时 `nvidia-smi` 抽样 **8 卡全部 95–99% util** +(每卡 ~23GB)——GPU **已 compute-bound 喂满、并非串行空转**(KI-5 当年「1–2/8 在忙」的串行病在 T11 的 +caching allocator 就已治好)。8 卡已满载却仍只 5.3×,缺的 ~35% 吞吐只能去向**每步 grad all-reduce + +本机 PCIe-only 拓扑在 8 rank 下的通信开销**——即 T11 早已点明的「~7% all-reduce + 8 卡 PCIe 余量」那一层, +在 8 卡下被放大。换独立 context 不动这一层,故吞吐不变。 + +**这与 T11 自身的方法论一致**:T11 实测证伪了「分桶 all-reduce」;T17 实测证伪了「process-per-GPU 解残留 +串行」。两次都靠 profile/measure 推翻假设而非硬上。**结论**:本尺度(dim384–1024、单机 8× PCIe RTX 5090) +残留非线性是**通信/拓扑墙**,不是 launch 模型;要再逼近线性得动 all-reduce overlap / 更快互联(NVLink), +那是另一条线,**非 T17 scope**。 + +**T17 的净价值(诚实记账)**:① 学到 / 落地了 torchrun 式 process-per-GPU 这条训练栈标准链路(独立进程 + +独立 CUDA context + 跨进程 NCCL bootstrap)——**项目本职「学训练全栈」的目标达成**;② **实测把「process-per-GPU +是残留非线性的解」这个长期挂在 KI-5/T11 doc 里的猜想钉死为「在本尺度无吞吐收益」**,移除一个误导性 backlog +项;③ 正确性零回归、与 thread 路径数值对齐。**吞吐上它与 thread-per-GPU 等价**——故默认训练路径**不变** +(thread-per-GPU 仍是 v1–v8 用的那条),process-per-GPU 作为并列可选路径 + 这条诊断结论留档。 ## 不做(本任务范围外,记 follow-up) diff --git a/docs/evolution.md b/docs/evolution.md index 7f61eae..86cbccc 100644 --- a/docs/evolution.md +++ b/docs/evolution.md @@ -28,6 +28,7 @@ | T15 | 模型架构 | **真 GQA**(`num_kv_heads1 对拍 composed/flash 各 loss 1.7e-8/logits 2.3e-5/25 grad 进 rtol;小 GQA(8h/2kv) 训 600 步 10.9→3.15 连贯;**xserv 闭环真 GQA**(num_kv 2<8):2/3 prompt token-identical、1 在 BF16 漂移处晚分叉;MHA 默认 export md5 逐位一致(b04fc9f9) | | T16 | 算法/Infra | **梯度累积**(N 个 micro-step:每个 micro-loss `×1/N` 再 backward,tape SUM 累加 → 一次 AdamW step+zero;`--accum-steps`);**DDP 只在累积边界 all-reduce**(中间 micro-step 不发 NCCL,`/world` 与 `1/N` 正交);显存随 micro 不随有效 batch | 等效大 batch**逐位贴合**(loss rel 8.5e-8、grad rel 3.8e-5);`accum=1` 逐位回归(0.00);DDP+accum 对单卡 loss 5.7e-7/跨 rank 一致;**显存平**:同有效 batch 64,big-batch 27.7GB→accum(4×16) **7.2GB(−74%)**(big-batch OOM 而 accum 装下);全回归+xserv 闭环 md5 一致 | | T18 | 算法 | **dropout**(手写 counter-based 设备 RNG → Bernoulli mask,训练 inverted 1/(1-p) scaling、eval 恒等);新 autodiff `dropout` 算子(fwd 生成+施加 mask,bwd 用同 mask),接 residual/ffn 两处;`--dropout` flag 默认 0 | 固定 seed grad-check 过;E[out]≈input + keep≈1-p;**p=0 与无 dropout 逐位一致**;recompute(T13) 组合下梯度仍逐位一致(counter-based seed 重算复现同 mask);全回归 + xserv 闭环绿(导出/推理 dropout 关) | +| T17 | Infra | **process-per-GPU**(torchrun 式:`launch_processes` 每卡 spawn 一个 worker 进程=独立 CUDA context;launcher 一次性铸 `ncclUniqueId` 后 **hex 编码注入子进程 env**——无共享 FS/TCP、无竞态;worker 读 env→bind device→`DdpContext::init`+`build_model`+`train_rank` **全复用 T8 零改动**;新 `train_ddp_mp` bin/`ddp_proc` test,**保留 thread-per-GPU 旧路径**);scope=process-per-GPU only(ZeRO-1 用户 drop)(Phase 2) | 正确性全绿:proc vs 单卡 loss 5.67e-7、**proc vs thread-per-GPU 1.5e-7**、跨 rank 1.19e-7(<1e-6)、全回归+xserv 闭环 md5 逐位一致 `b04fc9f9`。**⚠️关键发现(实测证伪原假设):本尺度 process-per-GPU 对吞吐中性**——thread vs proc @ {1,2,4,8} = {1.00/1.61/2.98/**5.27**}× vs {1.00/1.60/2.94/**5.31**}×(差<1% 噪声内);8 卡全 95–99% util ⇒ 残留 ~5.3×@8 非线性是 **NCCL all-reduce + 本机 PCIe 拓扑墙**,**非**单 CUDA context 串行(KI-5/T11 doc 的猜想被钉死推翻,方法论同 T11 证伪「分桶 all-reduce」)。净价值=落地 torchrun 式标准链路 + 把误导性 backlog 项实测关闭;默认训练路径不变 | --- @@ -55,7 +56,7 @@ - **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14,online softmax + flash 式 bwd) → 梯度累积(T16,复用 tape SUM,等效大 batch 而显存随 micro) → dropout(T18,counter-based 设备 RNG + inverted scaling,train/eval 切换)。 - **模型架构**:固定 Qwen3-style;dim **32→256→384→512→768→1024**(v8 首拨容量轴,头数 24→32);核心参数 **41K→226M**(总 3.26M→329M)。+QK-norm(T9,Qwen3 兼容) → **真 GQA(T15,`num_kv_heads