From f85bd4d27618fc3ad39b774eedbfed19d5818cb8 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Tue, 16 Jun 2026 11:15:02 +0800
Subject: [PATCH] =?UTF-8?q?perf:=20KI-5=20FIXED=20=E2=80=94=20single-GPU?=
 =?UTF-8?q?=2040K->93K=20tok/s,=20DDP=20scaling=201.3x->5x@8?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):

  single-GPU: 40226 -> 92638 tok/s  (~2.3x)
  DDP scaling (global batch 32*world):
    world  before        after
      1    39801 1.00x    92385 1.00x
      2    47229 1.19x   146821 1.59x
      4    52854 1.33x   269867 2.92x
      8    48996 1.23x   461270 4.99x

8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.

Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/10-caching-allocator.md | 30 ++++++++++++++++++++++++++++--
 docs/known-issues.md         | 32 ++++++++++++++++++++++++++++----
 2 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/docs/10-caching-allocator.md b/docs/10-caching-allocator.md
index cb6828d..ea13195 100644
--- a/docs/10-caching-allocator.md
+++ b/docs/10-caching-allocator.md
@@ -129,6 +129,32 @@ cuBLAS handle 仍串行），如实报告，并说明 **process-per-GPU**（每
   承重闸门是 loss-match（~5.7e-7）；本机 PCIe-only NCCL all-reduce run-to-run 跨 rank 非逐位可复现，
   diff ≤1.2e-7（几 ULP，数值无害）。`== 0.0` 过严 flaky。
 
-## Before → After
+## Before → After（dash5, 8× RTX 5090, sm_120）
 
-（dash5, 8× RTX 5090, 实测填入；见 known-issues.md KI-5 的 before/after 表与 commit。）
+实测（`train_ddp`, dim384/12L/12h·hd32 ffn1536 core 28.3M, per-rank batch 32, seq 256,
+steady-state tok/s；before = parent `d422c68`, after = pooled）：
+
+**单卡（KI-5 假设：per-op malloc 单卡也吃）**
+
+| | tok/s | GPU util |
+|---|---|---|
+| before | 40226 | 8 卡轮流忙，1–2/8 |
+| after | **92638** | — |
+
+→ 单卡 **~2.3×**，loss 轨迹逐位对住（10.9026→4.8453 before/after 一致）。
+
+**DDP 1/2/4/8 卡 scaling（global batch = 32×world）**
+
+| world | before tok/s | before speedup | after tok/s | after speedup |
+|---|---|---|---|---|
+| 1 | 39801 | 1.00× | 92385 | 1.00× |
+| 2 | 47229 | 1.19× | 146821 | 1.59× |
+| 4 | 52854 | 1.33× | 269867 | 2.92× |
+| 8 | 48996 | 1.23× | **461270** | **4.99×** |
+
+→ 8 卡绝对吞吐 **49K → 461K tok/s = 9.4×**；scaling 从「~1.3× 封顶」恢复到 **~5×@8**。
+8 卡运行 `nvidia-smi` 抽样 **8 卡全部 95–99% util**（KI-5 时只有 1–2/8 在忙）——
+per-op cudaMalloc 串行确是根因，pool 消掉后 GPU 变 compute-bound 喂满。
+
+**残留**：5×@8 非完美线性（grad all-reduce ~7% + 8 卡 PCIe / launch 余量），但弱扩展的悬崖已消。
+KI-5 标 **FIXED**。若 v4 要更高线性度，下一步才是 process-per-GPU（每 rank 独立 context）。
diff --git a/docs/known-issues.md b/docs/known-issues.md
index fa16e80..b5058e5 100644
--- a/docs/known-issues.md
+++ b/docs/known-issues.md
@@ -7,9 +7,35 @@
 
 ## Open
 
-_(KI-1 fixed in T10. KI-5 仍 Open，但 T11 实测把根因从「all-reduce 未分桶」**改诊断**为「单进程多 rank 的逐 rank compute 互相串行」——见下。原拟修复（分桶 all-reduce）经实测证伪。)_
+_(KI-1 fixed in T10. KI-5 **FIXED** in T11——device caching/pool allocator 消掉 per-op cudaMalloc 串行，单卡 ~2.3×、DDP scaling 从 ~1.3× 封顶恢复到 ~5×@8。见下方 Fixed。)_
 
-### KI-5 · DDP 弱扩展性 — `P2` · 由 T10 暴露，T11 重新诊断（all-reduce **不是**瓶颈）
+---
+
+## Fixed
+
+### KI-5 · DDP 弱扩展性 — `FIXED` (T11, device caching/pool allocator)
+- **根因（T11 重诊断，all-reduce **不是**瓶颈）**：每个 tape op 输出走 `Tensor::zeros`→`GpuBuffer::alloc`→`cudaMalloc`（同步、进程级串行的 driver 调用）。单进程 thread-per-GPU 下 N rank 每步几百次 alloc 在单 CUDA context 排队串行（`NOCOMM=1` 完全不通信时 fwd+bwd 仍 136→780ms 膨胀 ~6×，`nvidia-smi` 抽样 8 卡只 1–2 张在忙轮流跑）；单卡也吃这笔 per-op alloc。
+- **原拟修复「分桶 all-reduce」经 T11 实测证伪并 revert**：grad all-reduce 每步只占 ~6–7%，融成一发对 1/2/4/8 卡几乎无差（详见下方历史诊断）。
+- **修复**：`xtrain-cuda` 加 **device caching/pool allocator**（[docs/10-caching-allocator.md](10-caching-allocator.md)）——`GpuBuffer::alloc` 从 per-device、size-classed free-list 取，miss 才 `cudaMalloc`；`Drop` 归还 free-list（不 `cudaFree`）。训练定形状→命中率极高，warm-up 后每步 `cudaMalloc`≈0。线程安全：global registry 按 device id 分桶，每 device 的 free-list 各自 `Mutex`（registry 锁只在 clone 出 `Arc<Mutex<_>>` 时极短持有→跨 device 真并发）；buffer 记 alloc 时的 device，Drop 归还对应 pool。**透明**：物理 cap 可向上取整但 `len()`/memset/copy bounds 都用请求 `len`，尾部字节永不读到→数值逐位不变。memset 保留（复用 buffer 有陈旧字节）；skip-memset uninit 本次不做（malloc 已是瓶颈，memset async 开销小，逐 op 证明全覆盖风险大）。
+- **before → after**（dash5, 8× RTX 5090, dim384/12L per-rank batch 32 seq 256, steady-state tok/s; before=`d422c68` after=pooled）：
+
+  | world | before tok/s | before speedup | after tok/s | after speedup |
+  |---|---|---|---|---|
+  | 1 | 39801 | 1.00× | **92385** | 1.00× |
+  | 2 | 47229 | 1.19× | 146821 | 1.59× |
+  | 4 | 52854 | 1.33× | 269867 | 2.92× |
+  | 8 | 48996 | 1.23× | **461270** | **4.99×** |
+
+  → **单卡 40226→92638 tok/s (~2.3×)**；**8 卡 49K→461K tok/s (9.4×)**，scaling 从 ~1.3× 封顶恢复到 **~5×@8**；8 卡 `nvidia-smi` 抽样 **全 8 卡 95–99% util**（KI-5 时只 1–2/8 忙）。loss 轨迹逐位对住（单卡 10.9026→4.8453 before/after 一致）。
+- **正确性（全绿，无回归）**：15 算子 grad-check、5 结构、GEMM 对 cuBLAS、batched==looped、overfit 27/27、AdamW GPU bit-exact + host 对 torch、checkpoint 逐位、DDP loss 对单卡 **5.67e-7** + 跨 rank diff 0.0（loosened `<1e-6`）、**xserv 闭环**（v3 ckpt 重导 safetensors 与 registry md5 逐位一致 + xserv 加载服务贪心 "Once upon a time," 对住）。
+- **顺手**：DDP `ddp_correctness` 的 cross-rank `==0.0` → `<1e-6`（本机 PCIe-only NCCL run-to-run 跨 rank 非逐位可复现，diff≤1.2e-7 几 ULP 无害，承重闸门是 loss-match 5.67e-7）；`ddp_throughput_scaling` 扩到 world=8。
+- **残留**：~5×@8 非完美线性（grad all-reduce ~7% + 8 卡 PCIe/launch 余量），但弱扩展悬崖已消。v4 若要更高线性度，下一步是 **process-per-GPU**（每 rank 独立 CUDA context，torchrun 式）。
+- **commit**：见 T11 提交链（`cuda: device caching allocator` / `perf: KI-5 …` 那条带 before/after）。
+- **历史诊断保留如下**（证伪「分桶 all-reduce」的过程）：
+
+---
+
+### KI-5 历史诊断 · DDP 弱扩展性 — T10 暴露，T11 重诊断（all-reduce **不是**瓶颈）
 - **现象**：batched forward 修掉单卡 launch-bound 后，dim384/per-rank batch 32：1 卡 40.3K → 4 卡 47.2K tok/s（global），仅 ~1.17×。
 - **T11 实测（dash5, 8× RTX 5090, dim384/12L, per-rank batch 32, seq 256, 原 ungrouped all-reduce, 50 步均, ms/step）**：
 
@@ -29,8 +55,6 @@ _(KI-1 fixed in T10. KI-5 仍 Open，但 T11 实测把根因从「all-reduce 未
 
 ---
 
-## Fixed
-
 ### KI-1 · 单序列 launch-bound（"DDP 弱扩展性"的根因）— `FIXED` (T10, batched forward)
 - **修复**：T10 给 model + autograd 加 batch 维——linears 摊平成 `[B*S, dim]` 一个大 GEMM 填满 GPU；attention 走 fused 批量 SDPA（`cublasSgemmStridedBatched` ×2 + 一个 causal-softmax kernel），RoPE 位置 per-sequence 复位（`row % S`）；训练 loop 用真 batch 一次 forward/backward 替代 "loop B 次 + SUM"。详见 [docs/09-batched-forward.md](09-batched-forward.md)。
 - **before → after**（dim384/12L/12h, batch 16, seq 256, 1 卡, back-to-back A/B）：