From 71b0a1621fc0926f266418af49d38f5c304460e2 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Thu, 18 Jun 2026 18:03:14 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20T17=20process-per-GPU=20results=20?=
 =?UTF-8?q?=E2=80=94=20measured=20throughput-neutral?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Records the key empirical finding: process-per-GPU is statistically identical
to thread-per-GPU at this scale (thread 5.27x vs proc 5.31x @8, <1% noise; all
8 GPUs 95-99% util). The residual ~5.3x@8 non-linearity is the NCCL/PCIe
communication wall, NOT single-CUDA-context launch/cuBLAS serialization as the
old KI-5/T11 note speculated — measurement falsifies that hypothesis (same
methodology as T11 falsifying "bucket the all-reduce"). Correctness all green:
proc==thread loss 1.5e-7, cross-rank 1.2e-7, full regression + xserv md5
b04fc9f9 identical. Closes the process-per-GPU backlog item (measured no-op);
default training path unchanged. evolution.md Infra row + README T17 row +
known-issues entry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 README.md                  | 13 +++++++--
 docs/16-process-per-gpu.md | 57 +++++++++++++++++++++++++++++++++++---
 docs/evolution.md          |  8 ++++--
 docs/known-issues.md       | 22 ++++++++++++++-
 4 files changed, 89 insertions(+), 11 deletions(-)

diff --git a/README.md b/README.md
index 35fba11..08f9cca 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ borrows, the rest hand-written CUDA + Rust:
 | `xtrain-model` | tiny **Qwen3-style** transformer (RoPE + RMSNorm + QK-norm + SwiGLU), batched forward |
 | `xtrain-optim` | hand-written **AdamW** (host + GPU kernels) |
 | `xtrain-train` | training loop, LR schedule, grad clip, checkpoint, BPE corpus + cache, samplers, safetensors export |
-| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU, all-reduce) |
+| `xtrain-distributed` | **NCCL DDP** (thread-per-GPU + torchrun-style process-per-GPU, all-reduce) |
 
 Every op's backward is verified against **finite differences** and against **PyTorch**
 (forward + per-parameter grads, batch > 1). Trained weights export to HF-safetensors and
@@ -53,6 +53,7 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`]
 | **T14** | **fused flash-attention** kernel (online softmax, no materialized N×N; opt-in `--flash`) | peak mem −16%@1k / −23%@2k seq; flash==composed (grads/PyTorch) |
 | **T15** | **grouped-query attention** (`num_kv_heads<num_heads`; `repeat_kv` broadcast feeds both SDPA paths; backward sums each kv head's group; `--kv-heads`) | repeat_kv grad-check + **group=1 bit-identical to MHA**; GQA flash==composed; PyTorch GQA B>1; **xserv closed loop with real `num_key_value_heads`** token-identical |
 | **T16** | **gradient accumulation** (`--accum-steps`; DDP all-reduces only at the boundary) | equiv to N× big batch (grad 3.8e-5); same effective-64 batch 27.7GB→7.2GB (−74%) |
+| **T17** | **process-per-GPU** DDP (torchrun-style: 1 worker process / CUDA context per GPU; launcher mints `ncclUniqueId` → hex env injection; `train_rank` reused unchanged; thread-per-GPU path kept) | proc==thread loss 1.5e-7, cross-rank 1.2e-7, xserv md5 identical · **measured no-op on throughput**: thread 5.27× vs proc 5.31×@8 (8 GPUs 95–99% util) → residual non-linearity is NCCL/PCIe, *not* CUDA-context serialization (falsifies the old KI-5 hypothesis) |
 | **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe |
 
 The four performance fixes (T10–T13) each removed a real bottleneck — see
@@ -64,8 +65,14 @@ num_heads` via a `repeat_kv` broadcast op whose backward sums each kv head's que
 group — feeding both SDPA paths unchanged, default MHA bit-identical);
 T16 = micro-batch gradient accumulation ([`docs/15-grad-accum.md`](docs/15-grad-accum.md)),
 which decouples the effective batch from activation memory (memory tracks the micro-batch,
-not N×); T18 = dropout ([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based
-device RNG + mask, inverted scaling, train/eval switch).
+not N×); T17 = torchrun-style process-per-GPU DDP
+([`docs/16-process-per-gpu.md`](docs/16-process-per-gpu.md), one process + CUDA context per
+GPU, launcher-minted `ncclUniqueId` via env injection, reusing the T8 training step
+unchanged) — which **measured** that, at this scale, separate contexts give no throughput
+gain over thread-per-GPU (the residual ~5.3×@8 is the NCCL/PCIe communication wall, not
+single-context serialization as the old KI-5 note speculated); T18 = dropout
+([`docs/17-dropout.md`](docs/17-dropout.md), hand counter-based device RNG + mask, inverted
+scaling, train/eval switch).
 
 ## The scaling study — v0 → v8
 
diff --git a/docs/16-process-per-gpu.md b/docs/16-process-per-gpu.md
index 3946f18..47ff192 100644
--- a/docs/16-process-per-gpu.md
+++ b/docs/16-process-per-gpu.md
@@ -128,9 +128,12 @@ thread-per-GPU 的残留非线性（KI-5 / T11 doc）来自：**N rank 线程共
 pool allocator 消掉了 malloc 这一最大笔，但 launch / cuBLAS 串行仍在，表现为 8 卡 ~5× 而非 ~8×。
 
 process-per-GPU 下**每个 rank 是独立进程 → 独立 CUDA context → 独立 driver 状态**：各进程的 kernel
-launch / cuBLAS 调用**互不在同一 context 排队**，残留串行被结构性移除。这正是闸门 ③（before→after 线性度）
-要量出来的东西——若 process-per-GPU 把 8 卡从 ~5× 推到明显更高，即验证了根因诊断。**诚实原则**：若提升
-有限，如实报告（说明残留瓶颈转移到 NCCL all-reduce / PCIe 拓扑，那是另一层，非本任务 scope）。
+launch / cuBLAS 调用**互不在同一 context 排队**，残留串行（按此假设）应被结构性移除。这正是闸门 ③
+（before→after 线性度）要量出来的东西——若 process-per-GPU 把 8 卡从 ~5× 推到明显更高，即验证此假设。
+**诚实原则**：若提升有限，如实报告（说明残留瓶颈在 NCCL all-reduce / PCIe 拓扑，那是另一层，非本任务 scope）。
+
+> ⚠️ **此假设被实测证伪**——见下方「实测结果 · 闸门 ③」：process-per-GPU 与 thread-per-GPU 吞吐统计上一致
+> （~5.3×@8 都一样），且 8 卡全 95–99% util。残留非线性是通信/PCIe 墙，不是单 context 串行。结论钉死、留档。
 
 ### ④ 训练 step / 一致性论证：原样复用 T8，零改动
 
@@ -206,7 +209,53 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 cargo run -p xtrain-distributed --release --bin tra
   --dim 384 --heads 12 --head-dim 32 --layers 12 --ffn 1536 --steps 200 --batch 128 --seq 256
 ```
 
-实测数字回填见 xtrain.md T17 note / commit。
+实测数字见下方「实测结果」。
+
+## 实测结果（dash5, 8× RTX 5090, sm_120）
+
+### 正确性（闸门 ①②④ 全绿）
+
+- **闸门 ① loss 对单卡 / 对 thread 路径**（`ddp_proc`, world=2，合成语料 20 步）：
+  proc-per-GPU vs single-GPU `max_rel = 5.67e-7`；**proc-per-GPU vs thread-per-GPU `max_rel = 1.5e-7`**
+  （两条路径数值同量级，符合预期——只差进程/线程，sharding+all-reduce 同）。
+- **闸门 ② 跨 rank 参数**：`max|p0−p1| = 1.19e-7`（< 1e-6，KI-5 既有 ULP 容差，PCIe NCCL run-to-run 抖动）。
+- **闸门 ④ 全回归**：全 workspace `--test-threads=1` 全绿（autograd/structural/batched/bf16/recompute/
+  overfit/AdamW/既有 DDP/flash/gqa/grad_accum/dropout）+ **xserv 闭环**：v3 ckpt 用 T17 代码重导
+  safetensors 与 registry **md5 逐位一致 `b04fc9f9a0c9af04c47d9ca649aea12e`**（T17 不碰任何数值路径 → 必然一致）。
+
+### 闸门 ③ 线性度 before→after —— **本任务的关键发现：process-per-GPU 在本尺度对吞吐中性**
+
+固定每卡 batch 32 / dim384 / seq256 / 150 步（与 T11 KI-5 表同口径），steady-state tok/s：
+
+| world | thread-per-GPU (`train_ddp`) | speedup | process-per-GPU (`train_ddp_mp`) | speedup |
+|---|---|---|---|---|
+| 1 | 93257 | 1.00× | 92952 | 1.00× |
+| 2 | 149747 | 1.61× | 148809 | 1.60× |
+| 4 | 278276 | 2.98× | 273308 | 2.94× |
+| 8 | **491360** | **5.27×** | **493128** | **5.31×** |
+
+（world=8 重复 2 次确认非噪声：thread 493671/493292，proc 491102/494123——**两路差异 < 1%，落在 run-to-run 噪声内**。）
+
+→ **process-per-GPU 与 thread-per-GPU 吞吐统计上一致（~5.3×@8 都一样）**。本 doc 设计假设 ③
+（「残留 5×@8 来自单 CUDA context 的 kernel-launch/cuBLAS 串行，process-per-GPU 给独立 context 即可移除」）
+**被实测证伪**——这正是 ③ 里预留的「诚实原则」分支。
+
+**根因重定位（实测佐证）**：proc-per-GPU world=8 跑时 `nvidia-smi` 抽样 **8 卡全部 95–99% util**
+（每卡 ~23GB）——GPU **已 compute-bound 喂满、并非串行空转**（KI-5 当年「1–2/8 在忙」的串行病在 T11 的
+caching allocator 就已治好）。8 卡已满载却仍只 5.3×，缺的 ~35% 吞吐只能去向**每步 grad all-reduce +
+本机 PCIe-only 拓扑在 8 rank 下的通信开销**——即 T11 早已点明的「~7% all-reduce + 8 卡 PCIe 余量」那一层，
+在 8 卡下被放大。换独立 context 不动这一层，故吞吐不变。
+
+**这与 T11 自身的方法论一致**：T11 实测证伪了「分桶 all-reduce」；T17 实测证伪了「process-per-GPU 解残留
+串行」。两次都靠 profile/measure 推翻假设而非硬上。**结论**：本尺度（dim384–1024、单机 8× PCIe RTX 5090）
+残留非线性是**通信/拓扑墙**，不是 launch 模型；要再逼近线性得动 all-reduce overlap / 更快互联（NVLink），
+那是另一条线，**非 T17 scope**。
+
+**T17 的净价值（诚实记账）**：① 学到 / 落地了 torchrun 式 process-per-GPU 这条训练栈标准链路（独立进程 +
+独立 CUDA context + 跨进程 NCCL bootstrap）——**项目本职「学训练全栈」的目标达成**；② **实测把「process-per-GPU
+是残留非线性的解」这个长期挂在 KI-5/T11 doc 里的猜想钉死为「在本尺度无吞吐收益」**，移除一个误导性 backlog
+项；③ 正确性零回归、与 thread 路径数值对齐。**吞吐上它与 thread-per-GPU 等价**——故默认训练路径**不变**
+（thread-per-GPU 仍是 v1–v8 用的那条），process-per-GPU 作为并列可选路径 + 这条诊断结论留档。
 
 ## 不做（本任务范围外，记 follow-up）
 
diff --git a/docs/evolution.md b/docs/evolution.md
index 7f61eae..86cbccc 100644
--- a/docs/evolution.md
+++ b/docs/evolution.md
@@ -28,6 +28,7 @@
 | T15 | 模型架构 | **真 GQA**（`num_kv_heads<num_heads`：wk/wv 投影到 `kv_dim`，新 `repeat_kv` broadcast 算子把 K/V 复制 `group=nh/num_kv` 份喂给**未改动**的 composed/flash 两条 SDPA；分组约定对齐 xserv repeat_kv `dst=kvh·group+r`）；`repeat_kv` 反向=组内 group 行**确定性求和**（无 atomic）→ 多组 q 头梯度汇一个 kv 头；`num_kv_heads` 进 Config(默认=nh→MHA)、`--kv-heads` flag、导出写真 `num_key_value_heads`（Phase 2） | repeat_kv grad-check 2.1e-4(group3)+group1 identity 逐位；GQA flash==composed fp32 grad 4.1e-5/bf16 在带；**group1 对 MHA 逐位一致**(回归保护)；PyTorch GQA B>1 对拍 composed/flash 各 loss 1.7e-8/logits 2.3e-5/25 grad 进 rtol；小 GQA(8h/2kv) 训 600 步 10.9→3.15 连贯；**xserv 闭环真 GQA**(num_kv 2<8)：2/3 prompt token-identical、1 在 BF16 漂移处晚分叉；MHA 默认 export md5 逐位一致(b04fc9f9) |
 | T16 | 算法/Infra | **梯度累积**（N 个 micro-step：每个 micro-loss `×1/N` 再 backward，tape SUM 累加 → 一次 AdamW step+zero；`--accum-steps`）；**DDP 只在累积边界 all-reduce**（中间 micro-step 不发 NCCL，`/world` 与 `1/N` 正交）；显存随 micro 不随有效 batch | 等效大 batch**逐位贴合**（loss rel 8.5e-8、grad rel 3.8e-5）；`accum=1` 逐位回归(0.00)；DDP+accum 对单卡 loss 5.7e-7/跨 rank 一致；**显存平**：同有效 batch 64，big-batch 27.7GB→accum(4×16) **7.2GB(−74%)**（big-batch OOM 而 accum 装下）；全回归+xserv 闭环 md5 一致 |
 | T18 | 算法 | **dropout**（手写 counter-based 设备 RNG → Bernoulli mask，训练 inverted 1/(1-p) scaling、eval 恒等）；新 autodiff `dropout` 算子（fwd 生成+施加 mask，bwd 用同 mask），接 residual/ffn 两处；`--dropout` flag 默认 0 | 固定 seed grad-check 过；E[out]≈input + keep≈1-p；**p=0 与无 dropout 逐位一致**；recompute(T13) 组合下梯度仍逐位一致（counter-based seed 重算复现同 mask）；全回归 + xserv 闭环绿（导出/推理 dropout 关） |
+| T17 | Infra | **process-per-GPU**（torchrun 式：`launch_processes` 每卡 spawn 一个 worker 进程=独立 CUDA context；launcher 一次性铸 `ncclUniqueId` 后 **hex 编码注入子进程 env**——无共享 FS/TCP、无竞态；worker 读 env→bind device→`DdpContext::init`+`build_model`+`train_rank` **全复用 T8 零改动**；新 `train_ddp_mp` bin/`ddp_proc` test，**保留 thread-per-GPU 旧路径**）；scope=process-per-GPU only（ZeRO-1 用户 drop）（Phase 2） | 正确性全绿：proc vs 单卡 loss 5.67e-7、**proc vs thread-per-GPU 1.5e-7**、跨 rank 1.19e-7(<1e-6)、全回归+xserv 闭环 md5 逐位一致 `b04fc9f9`。**⚠️关键发现（实测证伪原假设）：本尺度 process-per-GPU 对吞吐中性**——thread vs proc @ {1,2,4,8} = {1.00/1.61/2.98/**5.27**}× vs {1.00/1.60/2.94/**5.31**}×（差<1% 噪声内）；8 卡全 95–99% util ⇒ 残留 ~5.3×@8 非线性是 **NCCL all-reduce + 本机 PCIe 拓扑墙**，**非**单 CUDA context 串行（KI-5/T11 doc 的猜想被钉死推翻，方法论同 T11 证伪「分桶 all-reduce」）。净价值=落地 torchrun 式标准链路 + 把误导性 backlog 项实测关闭；默认训练路径不变 |
 
 ---
 
@@ -55,7 +56,7 @@
 
 - **算法**：手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → 融合 flash-attention(T14，online softmax + flash 式 bwd) → 梯度累积(T16，复用 tape SUM，等效大 batch 而显存随 micro) → dropout(T18，counter-based 设备 RNG + inverted scaling，train/eval 切换)。
 - **模型架构**：固定 Qwen3-style；dim **32→256→384→512→768→1024**（v8 首拨容量轴，头数 24→32）；核心参数 **41K→226M**（总 3.26M→329M）。+QK-norm(T9，Qwen3 兼容) → **真 GQA(T15，`num_kv_heads<num_heads`，repeat_kv broadcast + 组内梯度求和；默认=nh→MHA 逐位回归)**——架构补齐到现代 LLM 标配（MHA/GQA/MQA 一条 `num_kv_heads` 轴），两条 SDPA(composed/flash) 共用同一 broadcast，导出真 `num_key_value_heads` 且 xserv 闭环。
-- **Infra**：单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13，解锁 dim1024) → flash-attention(T14，不物化 N×N，attention 显存收益随 seq 增长) → 梯度累积(T16，DDP 只在累积边界通信，显存随 micro 不随有效 batch)。吞吐 **3.3K→217K tok/s**（dim768 bf16），dim1024+重算 ~129K（重算税）；MFU **0.4%→17%**（每次提升都对应一块 perf 基建，详见 known-issues + MFU 分析）。T13/T14/T16 是三条**显存杠杆**（重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存），可叠加放大有效 batch。
+- **Infra**：单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13，解锁 dim1024) → flash-attention(T14，不物化 N×N，attention 显存收益随 seq 增长) → 梯度累积(T16，DDP 只在累积边界通信，显存随 micro 不随有效 batch) → process-per-GPU(T17，torchrun 式独立进程/CUDA context，复用 T8 train_rank 零改动)。吞吐 **3.3K→217K tok/s**（dim768 bf16），dim1024+重算 ~129K（重算税）；MFU **0.4%→17%**（每次提升都对应一块 perf 基建，详见 known-issues + MFU 分析）。T13/T14/T16 是三条**显存杠杆**（重计算压激活峰值、flash 不物化 N×N attention scores、梯度累积解耦有效 batch 与激活显存），可叠加放大有效 batch。**T17 实测=负结果记账**：process-per-GPU 在本尺度对吞吐**中性**（thread ~5.27× vs proc ~5.31×@8，差<1% 噪声），8 卡全 95–99% util ⇒ 残留非线性是 NCCL/PCIe 通信墙、**非**单 context 串行——把 KI-5/T11 doc 长挂的「process-per-GPU 是残留串行的解」猜想实测钉死推翻（方法论同 T11 证伪「分桶 all-reduce」）。
 - **数据集**：TinyStories 3MB 切片 → 全量 TinyStories（epoch 0.01→5.33，**至饱和**）→ **v6 毕业到 FineWeb-edu 真实网页**（2.255B 语料，1.02ep）→ **v7 同子集多 epoch（1.45ep，近顶）→ v8 同子集换大模型**（dim1024，1.05ep）。tokenizer 全程 gpt2 BPE（复用 xserv-tokenizer；v6 刻意不换 tokenizer 以隔离「数据来源」变量，KI-4 留后续版本）。
   - **v5→v6 数据轴的质变**：v0–v5 都吃合成幼儿故事（TinyStories，低熵、词汇受控），v5 证明同尺寸模型在它上面已饱和；v6 第一版换成**真实教育类网页文本**（FineWeb-edu），语言种类发生质变——采样从「只会写小故事」变成「能写历史/科学/说明文」。
   - ⚠️ **同子集多 epoch 也有天花板（v6→v7）**：v6 的 FineWeb val 才训 1.02ep、末步仍单调降，曾被读作「还没喂够」；v7 把**同一 2.255B 子集**喂到 1.45ep（多 ~1B token），FineWeb val 仅 ↓0.05（3.07→3.01）且 ~step44000 后走平、采样无质变 ⇒ **该子集在 dim768 已近天花板**。这与 v5 的 TinyStories 数据量饱和是**同一类现象**：**「重复喂老数据」边际都薄，无论是 v5 的同语料多 epoch 还是 v7 的同子集多 epoch**。真正抬天花板的是 v6「换更广的新语料」那一步——**杠杆在「更多样的新 token」，不在「同数据多读几遍」**。后续要继续降 val，必须补**新 FineWeb shards**（更多样、不重复），不是同子集加 epoch。
@@ -66,5 +67,6 @@
 ## 四、perf 杠杆台账（详见 [known-issues.md](known-issues.md)）
 
 - **已修**：KI-1 单序列 launch-bound（T10）· KI-5 per-op cudaMalloc 串行（T11）· KI-2 bf16/OOM（T12）· KI-3 激活重计算（T13，解锁 dim1024，v8 用上）。
-- **待办**：KI-4 大词表小 vocab · process-per-GPU（要更高多卡线性时）。
-- 两次「先 profile 再动手」证伪了错误的拟修复（KI-1「加大batch」、KI-5「分桶all-reduce」），避免了无效大改——profile-first。
+- **实测关闭（负结果）**：process-per-GPU（T17）——曾挂在 KI-5/T11 doc 作残留非线性的拟修复方向，T17 实测**吞吐中性**（thread ~5.27× vs proc ~5.31×@8，8 卡全满载），残留是 NCCL/PCIe 通信墙非 context 串行 → 不再是 perf 待办，链路本身已落地留作可选路径。
+- **待办**：KI-4 大词表小 vocab（接受的建模权衡）· 要更高多卡线性 → all-reduce overlap / NVLink 互联（非本尺度优先）。
+- **三次「先 profile/measure 再动手」证伪了错误的拟修复**（KI-1「加大batch」、KI-5「分桶all-reduce」、T17「process-per-GPU 解残留串行」），避免了无效大改——profile/measure-first。
diff --git a/docs/known-issues.md b/docs/known-issues.md
index bb3f120..1cd4d4e 100644
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -13,6 +13,26 @@ _(KI-1 fixed in T10. KI-5 fixed in T11. KI-2 fixed in T12. **KI-3（激活重计
 
 ## Fixed
 
+### process-per-GPU（torchrun 式独立 CUDA context）— `CLOSED / 实测负结果` (T17)
+- **背景**：KI-5（T11）修掉 per-op `cudaMalloc` 串行后，8 卡 scaling 从 ~1.3× 恢复到 **~5×@8**，但残留 ~5×@8 非完美线性。T11 doc / KI-5「残留」推测下一步是 **process-per-GPU**（每 rank 独立进程 + 独立 CUDA context，torchrun 式）——理由是「N rank 线程共享单 CUDA primary context，kernel-launch/cuBLAS 仍在 context 级串行」。**T17 把这条 torchrun 式链路落地并实测，证伪了该推测。**
+- **实现（[docs/16-process-per-gpu.md](16-process-per-gpu.md)）**：`xtrain-distributed` 加 `proc.rs`——`launch_processes` 每卡 spawn 一个 worker 进程（re-exec current_exe + `XTRAIN_{RANK,WORLD,LOCAL_RANK,NCCL_ID}` env）；**launcher 一次性铸 `ncclUniqueId` 后 hex 编码注入子进程 env**（无共享 FS/TCP、无轮询、无竞态——id 在子进程出生前就原子就绪）；worker 读 env → bind device（独立 CUDA context）→ `DdpContext::init` + `build_model` + `train_rank` **全部复用 T8 零改动**。新 `train_ddp_mp` bin + `ddp_proc` test；**保留 thread-per-GPU 旧路径**（回归 baseline）。scope=process-per-GPU only（ZeRO-1 用户 drop）。
+- **正确性（全绿，无回归）**：proc vs 单卡 loss `5.67e-7`、**proc vs thread-per-GPU `1.5e-7`**（两路数值同量级）、跨 rank `1.19e-7`（<1e-6）；全回归套 `--test-threads=1` 全绿 + **xserv 闭环 v3 重导 md5 逐位一致 `b04fc9f9`**（T17 不碰任何数值路径）。
+- **实测结果（关键，dash5 8× RTX 5090, dim384 per-rank batch32 seq256, steady-state）**：
+
+  | world | thread-per-GPU (`train_ddp`) | speedup | process-per-GPU (`train_ddp_mp`) | speedup |
+  |---|---|---|---|---|
+  | 1 | 93257 | 1.00× | 92952 | 1.00× |
+  | 2 | 149747 | 1.61× | 148809 | 1.60× |
+  | 4 | 278276 | 2.98× | 273308 | 2.94× |
+  | 8 | **491360** | **5.27×** | **493128** | **5.31×** |
+
+  （world=8 各重复 2 次：thread 493671/493292、proc 491102/494123——**两路差异 <1%，落在噪声内**。）
+- **诊断（证伪原推测）**：process-per-GPU world=8 跑时 `nvidia-smi` 抽样 **8 卡全部 95–99% util**（每卡 ~23GB）——GPU **已 compute-bound 喂满、非串行空转**（KI-5「1–2/8 在忙」的串行病 T11 allocator 已治好）。8 卡满载却仍只 5.3× ⇒ 缺的 ~35% 吞吐去向**每步 grad all-reduce + 本机 PCIe-only 拓扑在 8 rank 下的通信开销**（T11 早点明的「~7% all-reduce + PCIe 余量」那一层，8 卡放大），换独立 context 不动这一层。**结论：本尺度（dim384–1024、单机 8× PCIe RTX 5090）残留非线性是通信/拓扑墙，不是 launch 模型**——要再逼近线性须动 all-reduce overlap / NVLink 互联（非本尺度优先）。
+- **方法论一致**：T11 实测证伪「分桶 all-reduce」、T17 实测证伪「process-per-GPU 解残留串行」——两次都靠 measure 推翻假设而非硬上（profile/measure-first）。**净价值**：落地 torchrun 式 process-per-GPU 标准链路（项目本职「学训练全栈」）+ 把这个误导性 backlog 项**实测钉死关闭**。**默认训练路径不变**（thread-per-GPU），process-per-GPU 作并列可选路径留档。
+- **commit**：见 T17 提交链（`distributed: process-per-GPU launcher + worker` / `distributed: train_ddp_mp bin` / `test: process-per-GPU DDP correctness` / 设计文档 `docs: Phase T17 — process-per-GPU DDP design`）。
+
+---
+
 ### KI-3 · 激活重计算（gradient checkpointing）— `FIXED` (T13)
 - **触发点（v8 surfaced）**：容量轴放大到 dim1024（核心 ~210M+）测是否 capacity-limited。autograd tape 为反向保存所有中间激活，激活显存随 dim 线性增长——dim768 bf16 batch32 已 31.1GB（T12 甜点区），**dim1024 batch32 再次 OOM**（实测撞 32100/32607MiB → `OutOfMemory`）。
 - **设计（per-block gradient checkpointing，opt-in，[docs/12-activation-recompute.md](12-activation-recompute.md)）**：新增 `xtrain_autodiff::checkpoint(segment_fn, input, params)` 高阶原语（类比 `torch.utils.checkpoint`）。**前向**：把 input/params detach 成局部 leaf 跑 `segment_fn`，只取输出值，局部 tape 立即 drop → 段内激活释放（不留在外层 tape）；checkpoint 节点 parents=[input, ..params]。**反向**：从保存的 input + 未变的 param 值重跑 `segment_fn` 重建局部 tape，用上游 grad seed（`Var::backward_seeded`，新增——段输出非标量）回传，恢复的 input/param 梯度 push 给真 parents，局部 tape drop → 重算激活释放。模型每个 transformer block 前向用它包裹（`--recompute` flag，默认关）。切粒度 = 每 block。
@@ -68,7 +88,7 @@ _(KI-1 fixed in T10. KI-5 fixed in T11. KI-2 fixed in T12. **KI-3（激活重计
   → **单卡 40226→92638 tok/s (~2.3×)**；**8 卡 49K→461K tok/s (9.4×)**，scaling 从 ~1.3× 封顶恢复到 **~5×@8**；8 卡 `nvidia-smi` 抽样 **全 8 卡 95–99% util**（KI-5 时只 1–2/8 忙）。loss 轨迹逐位对住（单卡 10.9026→4.8453 before/after 一致）。
 - **正确性（全绿，无回归）**：15 算子 grad-check、5 结构、GEMM 对 cuBLAS、batched==looped、overfit 27/27、AdamW GPU bit-exact + host 对 torch、checkpoint 逐位、DDP loss 对单卡 **5.67e-7** + 跨 rank diff 0.0（loosened `<1e-6`）、**xserv 闭环**（v3 ckpt 重导 safetensors 与 registry md5 逐位一致 + xserv 加载服务贪心 "Once upon a time," 对住）。
 - **顺手**：DDP `ddp_correctness` 的 cross-rank `==0.0` → `<1e-6`（本机 PCIe-only NCCL run-to-run 跨 rank 非逐位可复现，diff≤1.2e-7 几 ULP 无害，承重闸门是 loss-match 5.67e-7）；`ddp_throughput_scaling` 扩到 world=8。
-- **残留**：~5×@8 非完美线性（grad all-reduce ~7% + 8 卡 PCIe/launch 余量），但弱扩展悬崖已消。v4 若要更高线性度，下一步是 **process-per-GPU**（每 rank 独立 CUDA context，torchrun 式）。
+- **残留**：~5×@8 非完美线性（grad all-reduce ~7% + 8 卡 PCIe/launch 余量），但弱扩展悬崖已消。曾以为下一步是 **process-per-GPU**（每 rank 独立 CUDA context，torchrun 式）——**T17 实测证伪该方向**（见下方「process-per-GPU（T17）」）：残留是**通信/PCIe 墙**，不是单 CUDA context 的 launch/cuBLAS 串行。
 - **commit**：见 T11 提交链（`cuda: device caching allocator` / `perf: KI-5 …` 那条带 before/after）。
 - **历史诊断保留如下**（证伪「分桶 all-reduce」的过程）：