Files

Gahow Wang f85bd4d276 perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8

Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):

  single-GPU: 40226 -> 92638 tok/s  (~2.3x)
  DDP scaling (global batch 32*world):
    world  before        after
      1    39801 1.00x    92385 1.00x
      2    47229 1.19x   146821 1.59x
      4    52854 1.33x   269867 2.92x
      8    48996 1.23x   461270 4.99x

8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.

Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-16 11:15:02 +08:00

8.5 KiB

Raw Blame History

Phase T11: Device Caching / Pool Allocator — Design Document

Goal

修 KI-5 的根因。T10 修掉单卡 launch-bound（1653→40K tok/s）后，DDP 多卡仍只有 ~1.4× 的弱扩展。 T11 第一版拟修复（分桶 all-reduce）经 dash5 实测证伪并 revert：grad all-reduce 每步只占 ~6–7%，融成一发对 1/2/4/8 卡几乎无差（见 docs/known-issues.md KI-5 表）。

实测重新定位的根因：每个 tape op 的输出都走 Tensor::zeros → GpuBuffer::alloc → cudaMalloc + cudaMemset。cudaMalloc/cudaFree 是同步、进程级串行的 driver 调用；在 单进程 thread-per-GPU 的 DDP 模型下，N 个 rank 线程每步几百次 alloc 在单 CUDA context 里排队互相串行（NOCOMM=1 完全不通信时 fwd+bwd 仍 136→780ms 膨胀 ~6×，nvidia-smi 抽样 8 卡同一时刻只有 1–2 张在忙、轮流跑）。这笔 per-op alloc 开销单卡也吃——训练定形状、每步重复 malloc/free 同样的几百个 buffer，纯属浪费。

T11 的修复：在 xtrain-cuda（GpuBuffer/cudaMalloc/cudaFree 所在）加一个 device caching / pool allocator——freed 的显存进 per-device 的 size-classed free-list 复用，不 cudaFree； alloc 优先从 free-list 取，miss 才 cudaMalloc。训练定形状 → 命中率极高，warm-up 后每步 cudaMalloc ≈ 0，消掉串行 driver 调用风暴。

硬闸门是正确性：allocator 必须透明——交出的字节、数值与改前逐位一致，所有既有 grad-check / PyTorch 对拍 / overfit / DDP / xserv 闭环必须仍过。在此之上拿吞吐收益。

Module Layout

crates/xtrain-cuda/src/
  pool.rs      ← 新增：global per-device free-list registry + size-class 逻辑
  memory.rs    ← GpuBuffer::alloc 从 pool 取；Drop 归还 pool（不 cudaFree）
  ffi.rs       ← 加 cudaGetDevice（Drop 要知道 buffer 属哪个 device pool）
  lib.rs       ← `mod pool;`

xtrain-tensor 零改动：Storage::zeros 仍 GpuBuffer::alloc + memset(0)，签名不变。 pool 完全藏在 GpuBuffer 后面，上层无感。

Key Design Decisions

1. Size class（按粒度向上取整 → 跨步可复用）

请求字节数向上取整到一个 size class，同形状的 op 输出落进同一 free-list、跨 step 复用：

const MIN_CLASS: usize = 512;          // 小分配的对齐粒度
const POW2_THRESHOLD: usize = 1 << 20; // 1 MiB

fn size_class(len) =
    if len <= 1 MiB  { ceil(len / 512) * 512 }   // 细粒度，浪费 ≤512B
    else             { len.next_power_of_two() }   // 粗粒度，class 数有界

小分配按 512B 对齐（浪费极小）；大分配按 2 的幂取整（class 数有界 → free-list 浅，最多浪费 ~2×，但显存是复用不是泄漏，定形状训练里大 buffer 的 class 也就那么几个）。

关键透明性：物理分配是 cap（取整后），但 GpuBuffer::len() 仍返回请求的 len：

memset(0) 只 zero 逻辑 len 字节（不是 cap）；
所有 copy（H2D/D2H）bounds 用 len，D2H 拷回 host 也只拷 len 字节；
op kernel 只按 shape（= len）读写。

→ cap - len 的尾部字节永不被任何人读到，所以 round-up 对数值完全透明。

2. Per-device + 线程安全（DDP thread-per-GPU）

DDP 是单进程 thread-per-GPU——pool 必须跨 rank 线程安全，且不能让不同 device 的线程互相串行 （否则没解决问题）：

global REGISTRY: Mutex<HashMap<device_id, Arc<Mutex<DevicePool>>>>
DevicePool { free: HashMap<size_class, Vec<*mut u8>> }

两级锁：registry 锁只在「按 device_id 取出（或首次插入）该 device 的 Arc<Mutex<DevicePool>>」这一瞬持有，立刻 clone Arc 出来、释放 registry 锁，再锁该 device 自己的 pool。 → 不同 device 的 rank 线程各锁各的 pool，真并发，registry 锁只是极短的查表。
buffer 在 alloc 时记下当前线程的 CUDA device（cudaGetDevice，DDP 每 rank 线程开头 set 一次），存进 GpuBuffer.device；Drop 时按这个 device 归还，保证 ptr 回到它所属 context 的 pool （即使 drop 发生在另一个 device 的线程上也对）。

3. Drop → 归还（不 cudaFree）

impl Drop for GpuBuffer {
    fn drop(&mut self) { pool::release(self.ptr, self.device, self.cap); }
}

free-list 无界（轻量、不做 eviction）——定形状训练的 working set 有界，每步复用同一批 buffer， free-list 深度自然收敛，不会无限涨。pool 持有的 ptr 活到进程退出，届时 OS 回收整个 device context，不是泄漏。

双重释放/泄漏边界审查：GpuBuffer 无 Clone，独占 ptr；Storage 用 Arc<GpuBuffer> 共享，最后一个 Arc 落地时 buffer 恰好 drop 一次 → release 一次。acquire 从 free-list pop 一个 ptr 交给唯一一个新 GpuBuffer，无别名。故无双重释放、无别名。

4. memset：保留（正确性优先），不做 skip-memset uninit

Storage::zeros 复用的 buffer 持有陈旧字节，故继续 memset(0)（正确性）。

任务给的 OPTIONAL bonus（给「完全覆盖输出」的 op 加 uninit/skip-memset）本次不做，诚实理由：

真正串行的是 cudaMalloc，已被 pool 消掉；cudaMemset 在 default stream 上 async、开销小。
要 skip 必须逐 op 证明输出被完全覆盖——matmul(beta=0 全写)能跳，但 embedding_bwd(scatter-add)、 sumsq_accum/sum_rows(累加器)、adamw(读写 m/v) 必须预 zero。审查面大、收益小、正确性风险高。
正确性是硬闸门，不为一个已非瓶颈的 async memset 冒风险。留作后续（若 profile 显示 memset 成新瓶颈再做）。

验证方法（双闸门）

闸门一：正确性（透明，零回归）

allocator 不改任何数值。全回归套必须仍绿：

T3 GEMM 对 cuBLAS；T4 各 op finite-diff grad-check（15 个）；
T5 结构 + overfit(27/27) + PyTorch 对拍（B>1，logits/每参数 grad）；
T6 AdamW 对 torch + checkpoint 逐位；
T8 DDP loss 对单卡（~5.7e-7）+ 跨 rank 一致；T10 batched==looped；
xserv 闭环：导出权重对 xtrain 贪心仍逐 token 一致。

闸门二：吞吐（收益）

单卡 tok/s before/after（malloc 风暴消失应↑）+ GPU util；
DDP 1/2/4/8 卡 scaling before/after（KI-5 调查的表）； ddp_throughput_scaling 测试扩到 world=8。

诚实原则：若单卡提速但多卡仍受限 → 说明串行比 malloc 更深（如单 context 下 kernel launch / cuBLAS handle 仍串行），如实报告，并说明 process-per-GPU（每 rank 独立 context，torchrun 式）是否是剩余的修复方向（profile 确认，如前两次调查）。

顺手项

放宽 DDP flaky 断言：ddp_correctness 的 cross-rank max|p0−p1| == 0.0 → < 1e-6。承重闸门是 loss-match（~5.7e-7）；本机 PCIe-only NCCL all-reduce run-to-run 跨 rank 非逐位可复现， diff ≤1.2e-7（几 ULP，数值无害）。== 0.0 过严 flaky。

Before → After（dash5, 8× RTX 5090, sm_120）

实测（train_ddp, dim384/12L/12h·hd32 ffn1536 core 28.3M, per-rank batch 32, seq 256, steady-state tok/s；before = parent d422c68, after = pooled）：

单卡（KI-5 假设：per-op malloc 单卡也吃）

	tok/s	GPU util
before	40226	8 卡轮流忙，1–2/8
after	92638	—

→ 单卡 ~2.3×，loss 轨迹逐位对住（10.9026→4.8453 before/after 一致）。

DDP 1/2/4/8 卡 scaling（global batch = 32×world）

world	before tok/s	before speedup	after tok/s	after speedup
1	39801	1.00×	92385	1.00×
2	47229	1.19×	146821	1.59×
4	52854	1.33×	269867	2.92×
8	48996	1.23×	461270	4.99×

→ 8 卡绝对吞吐 49K → 461K tok/s = 9.4×；scaling 从「~1.3× 封顶」恢复到 ~5×@8。 8 卡运行 nvidia-smi 抽样 8 卡全部 95–99% util（KI-5 时只有 1–2/8 在忙）—— per-op cudaMalloc 串行确是根因，pool 消掉后 GPU 变 compute-bound 喂满。

残留：5×@8 非完美线性（grad all-reduce ~7% + 8 卡 PCIe / launch 余量），但弱扩展的悬崖已消。 KI-5 标 FIXED。若 v4 要更高线性度，下一步才是 process-per-GPU（每 rank 独立 context）。

8.5 KiB Raw Blame History Unescape Escape