From 6b8c1e4e0f51f80d52842cc0cf84a209742f9dc7 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 00:05:08 +0800 Subject: [PATCH 1/5] =?UTF-8?q?docs:=20Phase=20T18=20=E2=80=94=20dropout?= =?UTF-8?q?=20design=20(device=20RNG=20+=20mask)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Counter-based (stateless) RNG → Bernoulli(keep=1-p) mask, inverted 1/(1-p) scaling at train, identity at eval. New autodiff `dropout` op (fwd generates + applies mask, bwd applies the SAME cached mask). Wired at the two residual-path sites (attn / ffn outputs); attention-probs dropout deliberately skipped (fused SDPA doesn't materialise probs). Documents the RNG choice, per-site deterministic seed (so T13 recompute reproduces the same mask), train/eval switch, p=0 bit-identity, and the acceptance gates. Co-Authored-By: Claude Opus 4.8 --- docs/17-dropout.md | 155 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/17-dropout.md diff --git a/docs/17-dropout.md b/docs/17-dropout.md new file mode 100644 index 0000000..d4844f3 --- /dev/null +++ b/docs/17-dropout.md @@ -0,0 +1,155 @@ +# Phase T18: Dropout(device RNG + mask)— Design Document + +## Goal + +在已有的 tape autograd 引擎(T4)+ tiny transformer(T5)之上,**手写一个 dropout 算子**: +训练时按 Bernoulli(keep = 1−p) 生成一个 0/1 mask,丢弃的元素置 0、保留的元素按 +**inverted dropout** 乘 `1/(1−p)`(让训练期望与推理一致);推理(eval)时 dropout 是**恒等**。 +新增一个 autodiff `dropout` 节点:**前向生成并施加 mask,反向施加同一个 mask**。 +接到模型的标准位置(residual 之前的 attention / MLP 子块输出;attention-probs dropout 不做,见下)。 +通过 `Config.dropout` / `--dropout` 暴露 `p`,**默认 `p=0`**。 + +明确范围(T18 只做这些): + +1. 一个 device 端 **counter-based RNG**(Philox 风格的 bit-mix),按 `(seed, 元素下标)` 无状态地产出 + 每元素的 Bernoulli 抽样 → 0/1 mask(保留=1,丢弃=0),同 seed **逐位可复现**。 +2. 一个 `dropout` autodiff 节点(fwd 生成 mask + 施加 inverted scaling;bwd 用**缓存的同一 mask**)。 +3. 模型里加 **training / eval 开关**:train 走 dropout、eval/采样/导出走恒等。 +4. `p` 经 `Config.dropout` 落地,`bin/train` 加 `--dropout` flag。 + +明确**不做**:attention-probs(softmax 后)dropout——本项目 attention 是**一个 fused batched SDPA 算子** +(`ops::attention`,softmax 在 kernel 内部不物化 probs 给外部施加 mask),在其上插 dropout 要么改 fused kernel、 +要么退回组合路径,**不值当**且偏离「标准 residual/ffn dropout」这条主线。文档明确记下「只做 residual-path dropout」。 + +## Module Layout + +``` +csrc/ops/dropout.cu # 新:counter-based RNG mask 生成 + 施加 (fwd) / 反向施加同 mask + # fp32 + bf16 两条(activation 流可能是 bf16,对齐 cast.cu 风格) + +crates/xtrain-cuda/ +├── build.rs # 新增 dropout.cu +└── src/ffi.rs # 新增 launch_dropout_{f32,bf16} 声明(no_cuda 门控) + +crates/xtrain-tensor/ +└── src/tensor.rs # 新增 Tensor::dropout_mask_apply(p, seed) -> (out, mask) + # Tensor::dropout_apply_mask(&mask) -> out(bwd 用) + +crates/xtrain-autodiff/ +├── src/ops.rs # 新增节点 dropout(x, p, seed)(p==0 提前返回 x.clone(),零节点) +└── tests/autograd.rs # 新增:固定 seed grad-check(mask 跨 ± 扰动固定)+ 期望保持数值检查 + +crates/xtrain-model/ +├── src/config.rs # Config 加 dropout: f32(默认 0) +├── src/model.rs # train/eval 开关(Cell)+ 在 attn/ffn 子块输出接 dropout; +│ # per-site 确定性 seed(与 checkpoint recompute 兼容) +└── tests/dropout.rs # 新增:p=0 逐位一致 / eval 恒等 / 期望保持 / p>0 小训练收敛 + +crates/xtrain-train/src/bin/train.rs # --dropout flag → Config.dropout;训练 model.train(),sample 前 model.eval() +``` + +为什么 RNG/mask 落在 `tensor.rs`(而非引擎):和 `scale`/`silu` 一样是一个 device kernel 的薄封装; +autodiff 层只负责把它包成带 backward 的 `Var` 节点(对齐 T4 既有分层)。 + +## Key Design Decisions + +### RNG:counter-based(Philox 风格),无状态、可复现、与重计算兼容 + +mask[i] 只由 `(seed, i)` 决定,**不读取任何可变 RNG 状态**: + +``` +key = seed XOR (i * 0x9E3779B97F4A7C15) // golden-ratio 常数打散下标 +h = splitmix64(key) // 几轮 bit-mix(xorshift+乘法) +u = (h >> 40) as f32 / 2^24 // [0,1) 均匀 +keep = u >= p // Bernoulli(keep = 1−p) +out[i] = keep ? x[i] * (1/(1−p)) : 0 +``` + +选 counter-based 而非「per-step 推进一个全局 LCG 状态」的关键原因 = **激活重计算(T13)**: +checkpoint 的 segment 在 backward 时会**重跑一遍 forward**(`segment_fn` 再执行)。 +若 dropout 用「调用时推进的可变状态」,重跑会拿到**不同的 mask** → 梯度与前向用的 mask 不一致 → 错。 +counter-based + **每个 dropout 站点一个确定性 seed**(见下)保证:重跑同 seed → **同 mask**, +重计算依旧逐位一致(T13 的硬闸门不被 dropout 破坏)。 + +> 复现性:同一 `(seed, p, shape)` 下 mask 逐位确定;fp32/bf16 mask 判定都在 fp32 里算 `u`(bf16 仅存/取 +> activation),所以两精度的 mask **同分布**(drop 与否由 fp32 `u` 决定,不受 bf16 舍入影响)。 + +### 每个 dropout 站点的确定性 seed(兼容 checkpoint 重算) + +模型持有一个 `base_seed`(`Cell`,每个训练 step 自增一次 → 每步换 mask)。`block_forward` +收到 `block_seed = base_seed XOR layer_index`,块内两处 dropout 再各 XOR 一个站点常量 +(attn=0xA77, ffn=0xF7N)派生出**该站点的 seed**。这些都是**纯函数**(只看 `base_seed + layer_index + +站点常量`,无可变推进),所以: + +- 同一 step 内不同站点 mask 不同(seed 不同); +- checkpoint 重算 `block_forward` 时,`block_seed` 由捕获的 `base_seed`/`layer_index` 重新算出 → **同 seed → 同 mask**; +- 跨 step mask 变化(`base_seed` 每步 +1)。 + +`base_seed` 的自增放在**训练入口**(`loss_batched` 训练态调用时 advance 一次)。eval/`forward`/采样 +**不 advance、不插 dropout**(恒等)。 + +### train / eval 开关 + +`TinyTransformer` 加一个 `Cell training`(默认 **false** = eval,安全:未显式开训练就不丢弃): + +- `model.train()` / `model.eval()` 切换(builder 风格 `with_training(bool)` 也提供,给测试)。 +- `forward_batched` 里:`p > 0 && training` 才在 attn/ffn 子块输出插 `ops::dropout`;否则**完全不建 dropout 节点**。 +- 因此 **`p == 0`** 或 **eval** → forward 图与改动前**逐字节相同**(`ops::dropout` 在 `p==0` 时也提前 + `return x.clone()`,双保险)→ 满足「p=0 与无 dropout 逐位一致」回归闸门。 + +训练 loop(`train`)开 `model.train()`;`eval_loss` / `generate` / 导出 `forward` 走 eval(恒等)—— +导出的模型权重不含任何 dropout,xserv 闭环不受影响。 + +### dropout 接在哪(wiring) + +接**两处 residual-path dropout**(标准 Pre-LN transformer 位置,对齐 GPT/LLaMA 训练实践): + +``` +h = h + dropout( attention(rms_norm(h)) ) # attn 子块输出,残差前 +h = h + dropout( swiglu_mlp(rms_norm(h)) ) # ffn 子块输出,残差前 +``` + +**不做** attention-probs dropout(理由见 Goal:fused SDPA 不物化 probs)。embedding dropout 也不做(非必需)。 + +### dropout 节点的 backward(为什么 grad-check 成立) + +``` +fwd: out = x ⊙ mask ⊙ (1/(1−p)) # mask 由 seed 生成,缓存进 backward 闭包 +bwd: dx = d ⊙ mask ⊙ (1/(1−p)) # 用同一个缓存 mask +``` + +dropout 在 **固定 mask** 下是一个逐元素线性映射 `out_i = c_i · x_i`(`c_i ∈ {0, 1/(1−p)}`), +其梯度就是 `dx_i = c_i · d_i`。finite-diff grad-check 之所以成立,关键是**前向缓存的 mask 在 ± 扰动两次 +forward 里保持不变**——本设计天然满足:mask 只由 `(seed, i)` 决定,与 `x` 的值无关,扰动 `x` 不改 mask。 +(grad-check 直接对 `ops::dropout` 节点跑:同一个 `seed` 调两次 forward 得到同一 mask,函数处处可微。) + +### 与既有特性的组合 + +- **bf16(T12)**:activation 流是 bf16 时,dropout kernel 走 bf16 分支(load→fp32 判 mask→store bf16), + mask 判定在 fp32,和 cast.cu 既有 bf16 elementwise 同风格;grad 也在 activation dtype(接回 bf16 链)。 +- **重计算(T13)**:见上「counter-based + 确定性 seed」——重算 mask 与前向逐位相同,T13 闸门不破。 +- **DDP(T8)**:每 rank 独立跑自己的 forward/backward,各自的 mask 由各 rank 的 `base_seed` 决定。 + 本任务的 DDP 闸门是「loss 对单卡 / 跨 rank 参数一致」,在 **dropout 关(默认 p=0)** 的回归配置下跑, + 不引入跨 rank mask 同步需求(p>0 时各 rank mask 本就该不同,属正常 DDP 语义)。 +- **梯度累积(T16)/ flash(T14)**:本分支独立于二者,不依赖其未合并改动。 + +## 验证方法 + +全部 `#![cfg(not(no_cuda))]` 门控;本地只 `cargo check`/`fmt`,构建 + 实跑在 dash5(8× RTX 5090, sm_120)。 + +**硬闸门(全绿,诚实正确性,不放宽容差)**: + +1. **固定 seed grad-check**(`autograd.rs::dropout_bwd`):对 `ops::dropout(x, p, seed)` 同一 seed + 跑 finite-diff(mask 跨 ± 扰动固定)→ `dx` 对中心差分通过(线性 op,用 `cfg_linear` 容差)。 +2. **train/eval + 期望保持**(`dropout.rs`): + - eval 恒等:`dropout` 关时 `out == x` **逐位**; + - 期望保持:大张量、训练态、对多组随机 mask 取均值,`E[out] ≈ x`(inverted scaling 正确),给数值; + - 实际 keep 比例 ≈ `1−p`(验证 RNG 分布)。 +3. **p=0 逐位一致**(`dropout.rs`):同 init 两个模型,一个不设 dropout、一个 `dropout=0`, + 同 batch forward+backward → **logits/loss/每参数 grad 逐位相同**(`|Δ| == 0`)。 +4. **p>0 小训练收敛**(`dropout.rs`,或 dash5 短跑):小模型开 `p=0.1` 训若干步,**loss 下降、无 NaN**。 +5. **全回归套绿**:autograd grad-checks、structural、batched==looped、bf16、recompute(逐位一致)、 + overfit 27/27、AdamW(GPU bit-exact + host vs torch)、DDP(loss-match + 跨 rank)、 + **xserv 闭环**(导出 md5 vs registry、token-identical;导出/推理 dropout **关**,导出模型不受影响)。 + +dash5 capture 每个闸门的 pass + 关键数字(max rel-err、期望 vs input、p=0 的 `|Δ|`、训练 loss 轨迹)。 From 1fdd0c5002ff841c1ae9a04628a30dbafbe67b92 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 00:05:18 +0800 Subject: [PATCH 2/5] dropout: device RNG kernel + Tensor fwd/bwd (T18) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit csrc/ops/dropout.cu: counter-based RNG (splitmix64 over seed^index) → fp32 uniform → Bernoulli(keep=1-p); fwd writes out=x⊙mask + an fp32 mask buffer (per-element 1/(1-p) or 0); bwd applies the same mask (dx=d⊙mask). fp32 + bf16 activation variants (mask fp32 in both; uniform is dtype-independent so masks match across precisions). Stateless → re-run with same seed = same mask (T13 recompute-safe). Registered in build.rs + FFI decls. Tensor::dropout(p,seed)->(out,mask) and Tensor::dropout_backward(d,mask) wrap the launches (contiguous F32/BF16, default stream, per-op sync via the kernels). Co-Authored-By: Claude Opus 4.8 --- crates/xtrain-cuda/build.rs | 1 + crates/xtrain-cuda/src/ffi.rs | 45 ++++++++++++ crates/xtrain-tensor/src/tensor.rs | 86 +++++++++++++++++++++++ csrc/ops/dropout.cu | 109 +++++++++++++++++++++++++++++ 4 files changed, 241 insertions(+) create mode 100644 csrc/ops/dropout.cu diff --git a/crates/xtrain-cuda/build.rs b/crates/xtrain-cuda/build.rs index b8d5a8c..04dbb14 100644 --- a/crates/xtrain-cuda/build.rs +++ b/crates/xtrain-cuda/build.rs @@ -37,6 +37,7 @@ fn main() { .file("../../csrc/ops/optim.cu") .file("../../csrc/ops/attention.cu") .file("../../csrc/ops/cast.cu") + .file("../../csrc/ops/dropout.cu") .compile("xtrain_cuda_kernels"); } diff --git a/crates/xtrain-cuda/src/ffi.rs b/crates/xtrain-cuda/src/ffi.rs index 8b10d16..f308673 100644 --- a/crates/xtrain-cuda/src/ffi.rs +++ b/crates/xtrain-cuda/src/ffi.rs @@ -447,3 +447,48 @@ unsafe extern "C" { s: CudaStream, ); } + +// Dropout (Phase T18, csrc/ops/dropout.cu). A counter-based (stateless) RNG: the +// keep/drop decision for element `i` is `hash(seed, i)` — no global state, so a +// re-run with the same `seed` reproduces the same mask (compatible with T13 +// activation recomputation). Forward writes `out = x ⊙ mask` and the fp32 `mask` +// buffer (mask[i] = (1/(1-p)) if kept else 0, the inverted-dropout scale); +// backward applies the SAME mask: dx = d ⊙ mask. fp32 + bf16 activation variants +// (mask is fp32 in both; the uniform is computed in fp32, dtype-independent). +#[cfg(not(no_cuda))] +unsafe extern "C" { + pub fn launch_dropout_fwd_f32( + x: *const f32, + out: *mut f32, + mask: *mut f32, + p: f32, + scale: f32, + seed: u64, + n: i32, + s: CudaStream, + ); + pub fn launch_dropout_bwd_f32( + d: *const f32, + mask: *const f32, + dx: *mut f32, + n: i32, + s: CudaStream, + ); + pub fn launch_dropout_fwd_bf16( + x: *const c_void, + out: *mut c_void, + mask: *mut f32, + p: f32, + scale: f32, + seed: u64, + n: i32, + s: CudaStream, + ); + pub fn launch_dropout_bwd_bf16( + d: *const c_void, + mask: *const f32, + dx: *mut c_void, + n: i32, + s: CudaStream, + ); +} diff --git a/crates/xtrain-tensor/src/tensor.rs b/crates/xtrain-tensor/src/tensor.rs index 4132c8e..d8c2dbf 100644 --- a/crates/xtrain-tensor/src/tensor.rs +++ b/crates/xtrain-tensor/src/tensor.rs @@ -668,6 +668,92 @@ impl Tensor { dx } + /// Dropout forward (Phase T18). Returns `(out, mask)` where, for each element + /// `i`, a counter-based RNG draws `u = hash(seed, i) ∈ [0,1)` and keeps the + /// element iff `u >= p`; kept elements are scaled by `1/(1-p)` (inverted + /// dropout, so `E[out] == x`). `mask[i]` stores that per-element factor + /// (`1/(1-p)` if kept, else `0`) for the backward to reuse — the same mask, so + /// the op is a fixed elementwise scale w.r.t. `x` (and finite-diff-checkable). + /// + /// The mask depends only on `(seed, i)`, NOT on `self`'s values, so a re-run + /// with the same `seed` reproduces the same mask (T13 recompute stays exact). + /// `mask` is always fp32 (the uniform is computed in fp32, dtype-independent); + /// `out` matches `self`'s dtype. Requires `0 <= p < 1`. + #[cfg(not(no_cuda))] + pub fn dropout(&self, p: f32, seed: u64) -> (Self, Self) { + assert!( + matches!(self.dtype, DType::F32 | DType::BF16), + "dropout supports F32/BF16" + ); + assert!((0.0..1.0).contains(&p), "dropout p must be in [0,1)"); + assert!(self.is_contiguous(), "dropout requires contiguous tensor"); + let scale = 1.0 / (1.0 - p); + let out = Tensor::zeros(&self.shape, self.dtype, self.device()); + let mask = Tensor::zeros(&self.shape, DType::F32, self.device()); + let n = self.numel() as i32; + match self.dtype { + DType::F32 => unsafe { + xtrain_cuda::ffi::launch_dropout_fwd_f32( + self.data_ptr() as *const f32, + out.data_ptr() as *mut f32, + mask.data_ptr() as *mut f32, + p, + scale, + seed, + n, + std::ptr::null_mut(), + ); + }, + DType::BF16 => unsafe { + xtrain_cuda::ffi::launch_dropout_fwd_bf16( + self.data_ptr() as *const std::ffi::c_void, + out.data_ptr() as *mut std::ffi::c_void, + mask.data_ptr() as *mut f32, + p, + scale, + seed, + n, + std::ptr::null_mut(), + ); + }, + _ => unreachable!(), + } + (out, mask) + } + + /// Dropout backward: `dx = d ⊙ mask` (the SAME `mask` the forward cached). + /// `d` is the upstream grad (activation dtype); `mask` is the fp32 factor + /// tensor from [`Self::dropout`]. Output matches `d`'s dtype. + #[cfg(not(no_cuda))] + pub fn dropout_backward(d: &Tensor, mask: &Tensor) -> Self { + assert_eq!(d.numel(), mask.numel(), "dropout_backward shape mismatch"); + assert_eq!(mask.dtype, DType::F32, "dropout mask must be F32"); + let dx = Tensor::zeros(&d.shape, d.dtype, d.device()); + let n = d.numel() as i32; + match d.dtype { + DType::F32 => unsafe { + xtrain_cuda::ffi::launch_dropout_bwd_f32( + d.data_ptr() as *const f32, + mask.data_ptr() as *const f32, + dx.data_ptr() as *mut f32, + n, + std::ptr::null_mut(), + ); + }, + DType::BF16 => unsafe { + xtrain_cuda::ffi::launch_dropout_bwd_bf16( + d.data_ptr() as *const std::ffi::c_void, + mask.data_ptr() as *const f32, + dx.data_ptr() as *mut std::ffi::c_void, + n, + std::ptr::null_mut(), + ); + }, + _ => panic!("dropout_backward supports F32/BF16"), + } + dx + } + /// RoPE forward (rotate_half). `self`:[tokens,heads,head_dim]; each token's /// position is `row % period`. `period` = sequence length, so a flattened /// batch `[B*S,heads,head_dim]` gets per-sequence positions (pass `period=S`); diff --git a/csrc/ops/dropout.cu b/csrc/ops/dropout.cu new file mode 100644 index 0000000..beeb8af --- /dev/null +++ b/csrc/ops/dropout.cu @@ -0,0 +1,109 @@ +// Dropout kernels (Phase T18). +// +// A counter-based (stateless) RNG: the keep/drop decision for element `i` is a +// pure function of (seed, i) — no global RNG state is advanced. This is what +// makes dropout compatible with activation recomputation (T13): when a +// checkpointed block re-runs its forward in backward, the SAME seed regenerates +// the SAME mask, so the recomputed activations / grads stay bit-identical to the +// forward (no mask drift). +// +// Inverted dropout: at training time kept elements are scaled by 1/(1-p) so the +// expectation E[out] == x (no inference-time rescale needed; eval is identity, +// handled in Rust by simply not calling dropout). +// +// key = seed ^ (i * GOLDEN) +// h = splitmix64(key) // a few rounds of xorshift/multiply +// u = (h >> 40) / 2^24 in [0,1) // 24-bit uniform +// keep = u >= p // Bernoulli(keep = 1-p) +// out = keep ? x * scale : 0 // scale = 1/(1-p) +// mask = keep ? scale : 0 // cached for backward (dx = d * mask) +// +// fp32 + bf16 variants: bf16 loads/stores half-size activations but the uniform +// `u` is always computed in fp32, so the mask distribution is identical across +// dtypes (drop decisions don't depend on bf16 rounding). The mask buffer is fp32 +// in both cases (it stores `scale` or 0 — exactly representable, tiny relative to +// the activation, reused only elementwise in backward). + +#include +#include + +extern "C" { + +// splitmix64: cheap, well-mixed counter hash. Maps a 64-bit counter to a 64-bit +// pseudo-random output; we only need the high bits for a uniform. +__device__ __forceinline__ uint64_t splitmix64(uint64_t x) { + x += 0x9E3779B97F4A7C15ULL; + x = (x ^ (x >> 30)) * 0xBF58476D1CE4E5B9ULL; + x = (x ^ (x >> 27)) * 0x94D049BB133111EBULL; + return x ^ (x >> 31); +} + +// Uniform [0,1) for element i under `seed`, computed in fp32 (dtype-independent). +__device__ __forceinline__ float dropout_uniform(uint64_t seed, int i) { + uint64_t key = seed ^ ((uint64_t)i * 0x9E3779B97F4A7C15ULL); + uint64_t h = splitmix64(key); + // Top 24 bits → [0,1) with 2^-24 resolution. + return (float)(h >> 40) * (1.0f / 16777216.0f); // 1/2^24 +} + +// fp32 forward: out[i] = keep ? x[i]*scale : 0 ; mask[i] = keep ? scale : 0. +__global__ void dropout_fwd_f32_k(const float* x, float* out, float* mask, + float p, float scale, uint64_t seed, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) { + float keep = (dropout_uniform(seed, i) >= p) ? scale : 0.0f; + mask[i] = keep; + out[i] = x[i] * keep; + } +} +void launch_dropout_fwd_f32(const float* x, float* out, float* mask, float p, + float scale, uint64_t seed, int n, void* s) { + int blk = 256, grid = (n + blk - 1) / blk; + dropout_fwd_f32_k<<>>(x, out, mask, p, scale, + seed, n); +} + +// Backward applies the SAME cached mask elementwise: dx[i] = d[i] * mask[i]. +__global__ void dropout_bwd_f32_k(const float* d, const float* mask, float* dx, + int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) dx[i] = d[i] * mask[i]; +} +void launch_dropout_bwd_f32(const float* d, const float* mask, float* dx, int n, + void* s) { + int blk = 256, grid = (n + blk - 1) / blk; + dropout_bwd_f32_k<<>>(d, mask, dx, n); +} + +// bf16 forward: activation is bf16; mask is fp32 (stores `scale` or 0). Uniform +// is fp32, so the mask matches the fp32 path bit-for-bit (same drop decisions). +__global__ void dropout_fwd_bf16_k(const __nv_bfloat16* x, __nv_bfloat16* out, + float* mask, float p, float scale, + uint64_t seed, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) { + float keep = (dropout_uniform(seed, i) >= p) ? scale : 0.0f; + mask[i] = keep; + out[i] = __float2bfloat16(__bfloat162float(x[i]) * keep); + } +} +void launch_dropout_fwd_bf16(const void* x, void* out, float* mask, float p, + float scale, uint64_t seed, int n, void* s) { + int blk = 256, grid = (n + blk - 1) / blk; + dropout_fwd_bf16_k<<>>( + (const __nv_bfloat16*)x, (__nv_bfloat16*)out, mask, p, scale, seed, n); +} + +__global__ void dropout_bwd_bf16_k(const __nv_bfloat16* d, const float* mask, + __nv_bfloat16* dx, int n) { + int i = blockIdx.x * blockDim.x + threadIdx.x; + if (i < n) dx[i] = __float2bfloat16(__bfloat162float(d[i]) * mask[i]); +} +void launch_dropout_bwd_bf16(const void* d, const float* mask, void* dx, int n, + void* s) { + int blk = 256, grid = (n + blk - 1) / blk; + dropout_bwd_bf16_k<<>>( + (const __nv_bfloat16*)d, mask, (__nv_bfloat16*)dx, n); +} + +} // extern "C" From 5eb27783f88c3ecc8e0f791022c72bb3a187cfd1 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 00:05:32 +0800 Subject: [PATCH 3/5] dropout: autodiff op + fixed-seed grad-check (T18) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ops::dropout(x,p,seed): fwd runs Tensor::dropout, caches the mask in the backward closure, bwd pushes dx=d⊙mask. p==0 returns x.clone() (no node) so the default graph is unchanged. Tests in autograd.rs: fixed-seed finite-diff grad-check (mask held constant across the ± perturbation — dropout is a fixed elementwise linear map of x); E[out]≈input + keep-rate≈1-p over a seed sweep; p=0 kernel identity. Co-Authored-By: Claude Opus 4.8 --- crates/xtrain-autodiff/src/ops.rs | 25 +++++++ crates/xtrain-autodiff/tests/autograd.rs | 90 ++++++++++++++++++++++++ 2 files changed, 115 insertions(+) diff --git a/crates/xtrain-autodiff/src/ops.rs b/crates/xtrain-autodiff/src/ops.rs index 0a5e489..b6cc9d8 100644 --- a/crates/xtrain-autodiff/src/ops.rs +++ b/crates/xtrain-autodiff/src/ops.rs @@ -140,6 +140,31 @@ pub fn swiglu(gate: &Var, up: &Var) -> Var { mul(&silu(gate), up) } +/// Dropout (Phase T18). With probability `p` zero each element, scale the kept +/// ones by `1/(1-p)` (inverted dropout — `E[out] == x`). The keep/drop mask is +/// drawn by a counter-based RNG from `(seed, element index)`, so it is fully +/// determined by `seed` (same `seed` ⇒ same mask: stable across the T13 recompute +/// re-run, and held fixed across the ± perturbation of a finite-diff grad-check). +/// Forward caches the per-element scale `mask`; **backward applies the same mask** +/// (`dx = d ⊙ mask`), making dropout a fixed elementwise linear map of `x`. +/// +/// `p == 0` is a no-op: returns `x.clone()` (no node added) so the default graph +/// is bit-identical to the no-dropout path. eval-time identity is handled by the +/// caller simply not invoking dropout (the model's train/eval switch). +pub fn dropout(x: &Var, p: f32, seed: u64) -> Var { + if p == 0.0 { + return x.clone(); + } + let (out, mask) = x.value().dropout(p, seed); + Var::from_op( + out, + vec![x.clone()], + Box::new(move |d, parents| { + Var::push_grad(&parents[0], Tensor::dropout_backward(d, &mask)); + }), + ) +} + /// RoPE (rotate_half) over `x:[tokens,heads,head_dim]` with per-sequence position /// `row % period` (`period` = sequence length; `period == tokens` for a single /// sequence). Orthogonal map, so the backward is the inverse rotation of `dy` — no diff --git a/crates/xtrain-autodiff/tests/autograd.rs b/crates/xtrain-autodiff/tests/autograd.rs index 9c2b48a..ac81b17 100644 --- a/crates/xtrain-autodiff/tests/autograd.rs +++ b/crates/xtrain-autodiff/tests/autograd.rs @@ -625,6 +625,96 @@ fn attention_batched_bwd() { ); } +// ---- dropout (Phase T18) ---- +// +// Fixed-seed finite-diff grad-check. Under a fixed `seed` the mask is constant +// (it depends only on (seed, index), NOT on x), so dropout is a fixed elementwise +// linear map `out_i = c_i·x_i` and the central difference of L is differentiable: +// the ± perturbation of each x_i sees the SAME mask. The forward function in the +// closure calls `ops::dropout(x, p, SEED)` with the same SEED, so it reproduces +// the same mask both times. +#[test] +fn dropout_bwd() { + require_gpu(); + const SEED: u64 = 0xD120_FE5E; + let p = 0.3f32; + let (m, n) = (16, 12); + let x_h = fill(m * n, 71); + let w = fill(m * n, 72); + + let x = Var::leaf(cuda(&x_h, &[m, n])); + let out = ops::dropout(&x, p, SEED); + scalar_loss(&out, &w).backward(); + let dx = x.grad().unwrap().to_device(Device::Cpu); + + let wf = w.clone(); + let lx = move |v: &[f32], s: &[usize]| { + let o = ops::dropout(&Var::leaf(cuda(v, s)), p, SEED); + weighted_sum(&o.value(), &wf) + }; + report( + "dropout dX", + &grad_check(&x_h, &[m, n], &lx, dx.as_slice::(), cfg_linear()), + ); +} + +// Inverted-dropout expectation + keep-rate check. Over a large tensor and a sweep +// of seeds, the mean of dropout(x) tracks the mean of x (E[out] ≈ x, the inverted +// 1/(1-p) scaling), and the kept fraction tracks 1-p (the RNG is ~Bernoulli). +#[test] +fn dropout_expectation_and_keep_rate() { + require_gpu(); + let p = 0.25f32; + let n = 200_000usize; + let x_h = vec![1.0f32; n]; // mean(x) = 1 → mean(out) should ≈ 1 + let x = cuda(&x_h, &[n]); + + let trials = 8; + let mut mean_out_acc = 0.0f64; + let mut keep_acc = 0.0f64; + for t in 0..trials { + let (out, mask) = x.dropout(p, 0x5EED_0000 + t as u64); + let out_h = out.to_device(Device::Cpu); + let mask_h = mask.to_device(Device::Cpu); + let mean_out: f64 = + out_h.as_slice::().iter().map(|&v| v as f64).sum::() / n as f64; + let kept = mask_h.as_slice::().iter().filter(|&&m| m != 0.0).count(); + mean_out_acc += mean_out; + keep_acc += kept as f64 / n as f64; + } + let mean_out = mean_out_acc / trials as f64; + let keep_rate = keep_acc / trials as f64; + println!( + "dropout p={p}: E[out]={mean_out:.5} (input mean 1.0), keep_rate={keep_rate:.5} (1-p={:.3})", + 1.0 - p + ); + assert!( + (mean_out - 1.0).abs() < 0.01, + "E[out] {mean_out} not ≈ input mean 1.0 (inverted scaling broken)" + ); + assert!( + (keep_rate - (1.0 - p) as f64).abs() < 0.01, + "keep_rate {keep_rate} not ≈ 1-p {}", + 1.0 - p + ); +} + +// p=0 is a no-op (the op returns x.clone(), no node) → output is bit-identical to +// x and its grad flows straight through (the default-graph regression guard at the +// op level; the model-level bit-identity is in xtrain-model/tests/dropout.rs). +#[test] +fn dropout_p0_is_identity() { + require_gpu(); + let (m, n) = (8, 5); + let x_h = fill(m * n, 91); + let x = cuda(&x_h, &[m, n]); + let (out, _mask) = x.dropout(0.0, 12345); + let out_h = out.to_device(Device::Cpu); + for (a, b) in x_h.iter().zip(out_h.as_slice::()) { + assert_eq!(*a, *b, "p=0 dropout must be identity"); + } +} + // --- test helpers --- // Scalar loss node L = sum(W ∘ out): wraps a fixed-weight Var and reduces. We From e625aa05dd607e0e8797bfae6f48d58ff9dda021 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 00:05:32 +0800 Subject: [PATCH 4/5] dropout: wire into model (residual sites) + train/eval switch + flag (T18) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Config.dropout (default 0). TinyTransformer gets a Cell training switch (train()/eval()/with_training, default eval = safe) + a Cell step_seed bumped once per training forward. forward_batched derives a per-layer block_seed (pure fn of step_seed×layer) and block_forward derives two per-site seeds, inserting ops::dropout at the attn and ffn sub-block outputs (before each residual). The seed is a pure function of (step_seed, layer, site) so the checkpoint (T13) recompute re-derives the same masks → grads stay exact. p=0 or eval → no dropout node → graph bit-identical to pre-T18. train_loop: model.train() per step (restored after eval flips to eval); eval_loss runs model.eval(). bin/train: --dropout flag → cfg.dropout. Export/sampling run in eval (default), so exported weights are dropout-free (xserv closed loop unaffected). Model-level tests (dropout.rs): p=0 bit-identical to no-dropout (logits/loss/grads); eval(p>0) == p=0 identity; train differs from eval + finite; recompute-with-dropout grads match non-recompute (fp32 + bf16). Co-Authored-By: Claude Opus 4.8 --- crates/xtrain-model/src/config.rs | 7 + crates/xtrain-model/src/model.rs | 105 ++++++++++-- crates/xtrain-model/tests/dropout.rs | 222 ++++++++++++++++++++++++++ crates/xtrain-train/src/bin/train.rs | 10 +- crates/xtrain-train/src/train_loop.rs | 5 + 5 files changed, 339 insertions(+), 10 deletions(-) create mode 100644 crates/xtrain-model/tests/dropout.rs diff --git a/crates/xtrain-model/src/config.rs b/crates/xtrain-model/src/config.rs index 3eba07a..554f930 100644 --- a/crates/xtrain-model/src/config.rs +++ b/crates/xtrain-model/src/config.rs @@ -20,6 +20,11 @@ pub struct Config { pub eps: f32, /// RoPE base frequency (theta). pub rope_theta: f32, + /// Dropout probability `p` (Phase T18). Applied at the attention/MLP sub-block + /// outputs (before each residual add) at TRAINING time, with inverted scaling + /// `1/(1-p)`; disabled (identity) at eval. Default `0.0` = no dropout, and the + /// forward graph is then bit-identical to the pre-T18 path. + pub dropout: f32, } impl Config { @@ -36,6 +41,7 @@ impl Config { ffn_hidden: 64, eps: 1e-5, rope_theta: 10000.0, + dropout: 0.0, } } @@ -60,6 +66,7 @@ impl Config { ffn_hidden, eps: 1e-5, rope_theta: 10000.0, + dropout: 0.0, } } diff --git a/crates/xtrain-model/src/model.rs b/crates/xtrain-model/src/model.rs index 830f068..bb57a3b 100644 --- a/crates/xtrain-model/src/model.rs +++ b/crates/xtrain-model/src/model.rs @@ -2,6 +2,8 @@ #![cfg(not(no_cuda))] +use std::cell::Cell; + use crate::config::Config; use xtrain_autodiff::ops; use xtrain_autodiff::tape::Var; @@ -47,6 +49,19 @@ pub struct TinyTransformer { /// existing numerics are bit-identical; recompute is mathematically exact, so /// grads match the non-checkpointed path within fp tolerance. recompute: bool, + /// Training mode for dropout (Phase T18). `true` → the attn/MLP sub-block + /// outputs pass through `ops::dropout` (with `cfg.dropout` and a per-step, + /// per-site seed); `false` (default) → dropout is identity (eval/sampling/ + /// export). `Cell` so `train()`/`eval()` flip it through `&self` (the forward + /// takes `&self`). When `cfg.dropout == 0` this flag is irrelevant — the graph + /// is bit-identical to the no-dropout path either way. + training: Cell, + /// Per-step dropout RNG seed (Phase T18). Bumped once at the start of each + /// TRAINING forward so every step draws fresh masks; combined with the layer + /// index + a per-site constant to give each dropout site its own seed. The RNG + /// is counter-based, so re-running a checkpointed block's forward in backward + /// (T13) reproduces the same seed → the same mask (recompute stays exact). + step_seed: Cell, } impl TinyTransformer { @@ -90,6 +105,8 @@ impl TinyTransformer { lm_head, compute_dtype: DType::F32, recompute: false, + training: Cell::new(false), + step_seed: Cell::new(0), } } @@ -127,6 +144,30 @@ impl TinyTransformer { self.recompute } + /// Switch to training mode (Phase T18): dropout (if `cfg.dropout > 0`) is + /// active in subsequent forwards. The training loop calls this before stepping. + pub fn train(&self) { + self.training.set(true); + } + + /// Switch to eval mode (Phase T18): dropout is identity. Held-out eval, + /// autoregressive sampling, and weight export all run in this mode (default). + pub fn eval(&self) { + self.training.set(false); + } + + pub fn is_training(&self) -> bool { + self.training.get() + } + + /// Builder-style train/eval toggle (Phase T18) — handy for tests that want a + /// model fixed in one mode. Equivalent to [`train`](Self::train) / + /// [`eval`](Self::eval) but chains off `new(..)`. + pub fn with_training(self, training: bool) -> Self { + self.training.set(training); + self + } + /// All learnable parameters, in a stable order. The optimizer (a hand-written /// GD step in T5, AdamW in T6) iterates this; each holds its `.grad()` after /// `backward()`. @@ -176,13 +217,34 @@ impl TinyTransformer { ); let seq = total / batch; + // Dropout (T18) is active only in training mode with p>0; otherwise it is + // identity (`ops::dropout` no-ops at p==0). Bump the per-step seed ONCE per + // training forward so each step draws fresh masks (counter-based RNG, so a + // checkpointed block's recompute reproduces the same seed → same mask). + let dropout_p = if self.training.get() { + self.cfg.dropout + } else { + 0.0 + }; + if dropout_p > 0.0 { + self.step_seed.set(self.step_seed.get().wrapping_add(1)); + } + let base_seed = self.step_seed.get(); + // Embedding gathers from the fp32 master table; in bf16 mode cast the // activation stream to bf16 here (norms are cast to bf16 gammas too). let mut h = ops::embedding(&self.embed, ids); // [batch*seq, dim], fp32 if self.compute_dtype == DType::BF16 { h = ops::cast(&h, DType::BF16); } - for b in &self.blocks { + for (li, b) in self.blocks.iter().enumerate() { + // Per-layer dropout seed: a deterministic function of (base_seed, + // layer index) — NOT a mutable counter — so the checkpoint recompute + // (which re-derives it from the captured base_seed/li) gets the same + // masks. The block derives its two per-site seeds from this. + let block_seed = base_seed + .wrapping_mul(0x100000001B3) + .wrapping_add(li as u64); h = if self.recompute { // Activation recomputation (T13): run the whole block forward inside // `checkpoint` so its internal activations aren't kept on the tape; @@ -190,7 +252,9 @@ impl TinyTransformer { // segment fn captures only `Copy` config (no borrow of `self`) and // receives the block's params via the slice, in `block_params` order. let (cfg, cdt) = (self.cfg, self.compute_dtype); - let seg = move |x: &Var, p: &[Var]| block_forward(cfg, cdt, batch, seq, x, p); + let seg = move |x: &Var, p: &[Var]| { + block_forward(cfg, cdt, batch, seq, dropout_p, block_seed, x, p) + }; xtrain_autodiff::checkpoint::checkpoint(seg, &h, &b.block_params()) } else { block_forward( @@ -198,6 +262,8 @@ impl TinyTransformer { self.compute_dtype, batch, seq, + dropout_p, + block_seed, &h, &b.block_params(), ) @@ -275,25 +341,46 @@ fn norm_gamma(cdt: DType, gamma: &Var) -> Var { } /// One transformer block's forward: pre-norm + multi-head causal attention + -/// residual, then pre-norm + SwiGLU MLP + residual. Pure in `(cfg, cdt, batch, -/// seq, input, params)` (no `&self`) so it can be the segment fn of -/// [`xtrain_autodiff::checkpoint`] for activation recomputation (T13). `params` is -/// the block's leaves in [`Block::block_params`] order. -fn block_forward(cfg: Config, cdt: DType, batch: usize, seq: usize, h: &Var, p: &[Var]) -> Var { +/// (T18) dropout + residual, then pre-norm + SwiGLU MLP + dropout + residual. +/// Pure in `(cfg, cdt, batch, seq, dropout_p, block_seed, input, params)` (no +/// `&self`, all `Copy`) so it can be the segment fn of +/// [`xtrain_autodiff::checkpoint`] for activation recomputation (T13) — the +/// recompute re-derives the same per-site seeds, so the dropout masks are +/// reproduced bit-for-bit. `dropout_p == 0` makes `ops::dropout` a no-op (the +/// graph is then identical to the pre-T18 path). `params` is the block's leaves in +/// [`Block::block_params`] order. +#[allow(clippy::too_many_arguments)] +fn block_forward( + cfg: Config, + cdt: DType, + batch: usize, + seq: usize, + dropout_p: f32, + block_seed: u64, + h: &Var, + p: &[Var], +) -> Var { let (attn_norm, wq, wk, wv) = (&p[0], &p[1], &p[2], &p[3]); let (q_norm, k_norm, wo) = (&p[4], &p[5], &p[6]); let (ffn_norm, w_gate, w_up, w_down) = (&p[7], &p[8], &p[9], &p[10]); - // --- Attention sub-block (pre-norm + residual) --- + // Per-site dropout seeds (XOR a site constant into the block seed) so the two + // residual-path dropouts draw independent masks within the same step/layer. + let attn_seed = block_seed ^ 0x0A7700; + let ffn_seed = block_seed ^ 0x0FF700; + + // --- Attention sub-block (pre-norm + dropout + residual) --- let normed = ops::rms_norm(h, &norm_gamma(cdt, attn_norm), cfg.eps); let attn = attention( cfg, cdt, batch, seq, &normed, wq, wk, wv, q_norm, k_norm, wo, ); + let attn = ops::dropout(&attn, dropout_p, attn_seed); let h = ops::add(h, &attn); - // --- MLP sub-block (pre-norm + residual) --- + // --- MLP sub-block (pre-norm + dropout + residual) --- let normed = ops::rms_norm(&h, &norm_gamma(cdt, ffn_norm), cfg.eps); let mlp = swiglu_mlp(cdt, &normed, w_gate, w_up, w_down); + let mlp = ops::dropout(&mlp, dropout_p, ffn_seed); ops::add(&h, &mlp) } diff --git a/crates/xtrain-model/tests/dropout.rs b/crates/xtrain-model/tests/dropout.rs new file mode 100644 index 0000000..04d5d68 --- /dev/null +++ b/crates/xtrain-model/tests/dropout.rs @@ -0,0 +1,222 @@ +// T18 dropout model-level gates. +// +// 1. p=0 bit-identical: a model built with cfg.dropout=0 (in either train or +// eval mode) produces logits/loss/grads bit-for-bit identical to the same +// model with no dropout field touched — the default forward graph is +// unchanged (the regression guard). +// 2. eval identity: with p>0 but eval mode, the forward equals the p=0 forward +// bit-for-bit (dropout is OFF at eval). +// 3. train vs eval differ: with p>0 and train mode, the forward differs from +// eval (dropout actually does something) and grads are still finite. +// 4. recompute compatibility: with p>0 + train + recompute, grads match the +// non-recompute path (the counter-based seed reproduces the same mask on the +// backward re-run — T13 stays exact even with dropout in the block). +// +// (The fixed-seed grad-check of the dropout op and the E[out]≈x / keep-rate check +// live in xtrain-autodiff/tests/autograd.rs; p>0 training convergence is the +// dash5 short run noted in docs/17-dropout.md.) +#![cfg(not(no_cuda))] + +use xtrain_cuda::device; +use xtrain_model::{Config, TinyTransformer, batched_ids_tensor}; +use xtrain_tensor::{DType, Device}; + +fn fill(n: usize, seed: u64, scale: f32) -> Vec { + let mut state = seed + .wrapping_mul(2862933555777941757) + .wrapping_add(3037000493); + (0..n) + .map(|_| { + state = state + .wrapping_mul(6364136223846793005) + .wrapping_add(1442695040888963407); + (((state >> 33) as f32 / (1u64 << 31) as f32) - 0.5) * 2.0 * scale + }) + .collect() +} + +fn build(cfg: Config, device: Device) -> TinyTransformer { + let mut seed = 1u64; + TinyTransformer::new(cfg, device, |shape| { + seed = seed.wrapping_add(1); + let n: usize = shape.iter().product(); + if shape.len() == 1 { + fill(n, seed, 0.02).iter().map(|v| v + 1.0).collect() + } else { + fill(n, seed, 0.08) + } + }) +} + +fn host(t: &xtrain_tensor::Tensor) -> Vec { + t.to_dtype(DType::F32) + .to_device(Device::Cpu) + .as_slice::() + .to_vec() +} + +fn tiny_cfg(dropout: f32) -> Config { + let mut cfg = Config::tiny(); + cfg.vocab = 16; + cfg.n_layers = 4; + cfg.dropout = dropout; + cfg +} + +fn batch_data(cfg: &Config, device: Device) -> (xtrain_tensor::Tensor, xtrain_tensor::Tensor) { + let (batch, seq) = (3usize, 6usize); + let seqs: Vec> = (0..batch) + .map(|b| (0..seq).map(|i| ((b * 7 + i * 3 + 1) % cfg.vocab) as i32).collect()) + .collect(); + let tgts: Vec> = (0..batch) + .map(|b| (0..seq).map(|i| ((b * 5 + i * 2 + 2) % cfg.vocab) as i32).collect()) + .collect(); + ( + batched_ids_tensor(&seqs, device), + batched_ids_tensor(&tgts, device), + ) +} + +fn require_gpu() -> Device { + assert!(device::device_count().unwrap() > 0, "no CUDA device"); + device::set_device(0).unwrap(); + Device::Cuda(0) +} + +// Run forward+backward, return (logits, loss, per-param grads). +fn fwd_bwd( + m: &TinyTransformer, + ids: &xtrain_tensor::Tensor, + tgt: &xtrain_tensor::Tensor, + batch: usize, +) -> (Vec, f32, Vec>) { + let logits = host(&m.forward_batched(ids, batch).value()); + let loss = m.loss_batched(ids, tgt, batch); + let loss_val = host(&loss.value())[0]; + loss.backward(); + let grads: Vec> = m.params().iter().map(|p| host(&p.grad().unwrap())).collect(); + (logits, loss_val, grads) +} + +// --- Gate 3: p=0 is bit-identical to the no-dropout path (default graph). --- +#[test] +fn dropout_p0_bit_identical() { + let device = require_gpu(); + let batch = 3; + + // Reference: cfg.dropout default (0.0), never touched train/eval. + let cfg0 = tiny_cfg(0.0); + let (ids, tgt) = batch_data(&cfg0, device); + let ref_m = build(cfg0, device); + let (ref_logits, ref_loss, ref_grads) = fwd_bwd(&ref_m, &ids, &tgt, batch); + + // p=0 in TRAINING mode: the seed bump is gated on p>0, the op no-ops at p==0, + // so the graph must be byte-identical. + let p0_train = build(tiny_cfg(0.0), device); + p0_train.train(); + let (lt, lst, gt) = fwd_bwd(&p0_train, &ids, &tgt, batch); + + assert_eq!(ref_logits, lt, "p=0 train logits not bit-identical"); + assert_eq!(ref_loss, lst, "p=0 train loss not bit-identical"); + for (i, (a, b)) in ref_grads.iter().zip(>).enumerate() { + assert_eq!(a, b, "p=0 train grad[{i}] not bit-identical"); + } + println!("p=0 (train) vs no-dropout: logits/loss/grads bit-identical ✅"); +} + +// --- Gate 2: eval is exact identity (p>0 but eval mode == p=0). --- +#[test] +fn dropout_eval_is_identity() { + let device = require_gpu(); + let batch = 3; + let cfg = tiny_cfg(0.2); + let (ids, tgt) = batch_data(&cfg, device); + + // p=0 reference and a p=0.2 model held in eval — outputs must match bit-for-bit. + let ref_m = build(tiny_cfg(0.0), device); + let (ref_logits, ref_loss, ref_grads) = fwd_bwd(&ref_m, &ids, &tgt, batch); + + let eval_m = build(cfg, device); + eval_m.eval(); // explicit; also the default + let (el, els, eg) = fwd_bwd(&eval_m, &ids, &tgt, batch); + + assert_eq!(ref_logits, el, "eval (p>0) logits not identity"); + assert_eq!(ref_loss, els, "eval (p>0) loss not identity"); + for (i, (a, b)) in ref_grads.iter().zip(&eg).enumerate() { + assert_eq!(a, b, "eval (p>0) grad[{i}] not identity"); + } + println!("eval (p=0.2) == no-dropout: bit-identical (eval is identity) ✅"); +} + +// --- Gate (train vs eval differ): with p>0 + train, dropout actually fires. --- +#[test] +fn dropout_train_differs_from_eval() { + let device = require_gpu(); + let batch = 3; + let cfg = tiny_cfg(0.3); + let (ids, _tgt) = batch_data(&cfg, device); + + let m = build(cfg, device); + m.eval(); + let eval_logits = host(&m.forward_batched(&ids, batch).value()); + m.train(); + let train_logits = host(&m.forward_batched(&ids, batch).value()); + + let max_diff = eval_logits + .iter() + .zip(&train_logits) + .map(|(a, b)| (a - b).abs()) + .fold(0.0f32, f32::max); + assert!( + max_diff > 1e-4 && train_logits.iter().all(|v| v.is_finite()), + "train logits should differ from eval (dropout active) and be finite; max_diff={max_diff}" + ); + println!("train vs eval logits max diff {max_diff:.4e} (dropout active in train) ✅"); +} + +// --- Gate 4: p>0 + recompute grads match non-recompute (T13 stays exact). --- +// The counter-based seed is a pure function of (step_seed, layer, site); the +// checkpoint backward re-runs block_forward and re-derives the SAME seeds, so the +// recomputed dropout masks match the forward — grads stay bit-identical. +fn recompute_with_dropout(dtype: DType, grad_tol: f32) { + let device = require_gpu(); + let batch = 3; + let cfg = tiny_cfg(0.2); + let (ids, tgt) = batch_data(&cfg, device); + + // Both models: same init, train mode, p=0.2. step_seed starts at 0 and bumps + // to 1 on the first training forward in BOTH, so they draw the same masks. + let off = build(cfg, device).with_compute_dtype(dtype).with_training(true); + let on = build(cfg, device) + .with_compute_dtype(dtype) + .with_recompute(true) + .with_training(true); + + let off_loss = off.loss_batched(&ids, &tgt, batch); + off_loss.backward(); + let off_grads: Vec> = off.params().iter().map(|p| host(&p.grad().unwrap())).collect(); + + let on_loss = on.loss_batched(&ids, &tgt, batch); + on_loss.backward(); + let on_grads: Vec> = on.params().iter().map(|p| host(&p.grad().unwrap())).collect(); + + let mut max_rel = 0.0f32; + for (a, b) in off_grads.iter().flatten().zip(on_grads.iter().flatten()) { + max_rel = max_rel.max((a - b).abs() / a.abs().max(1e-3)); + } + println!("[{dtype:?}] dropout p=0.2 recompute on/off grad max rel = {max_rel:.3e}"); + assert!( + max_rel < grad_tol, + "[{dtype:?}] recompute grads diverged with dropout: {max_rel:.3e}" + ); +} + +#[test] +fn dropout_recompute_matches_fp32() { + recompute_with_dropout(DType::F32, 1e-4); +} + +#[test] +fn dropout_recompute_matches_bf16() { + recompute_with_dropout(DType::BF16, 5e-3); +} diff --git a/crates/xtrain-train/src/bin/train.rs b/crates/xtrain-train/src/bin/train.rs index b1d50cf..c11a12e 100644 --- a/crates/xtrain-train/src/bin/train.rs +++ b/crates/xtrain-train/src/bin/train.rs @@ -109,6 +109,10 @@ fn main() { let val_tokens: usize = flag(&args, "--val-tokens", 0); let eval_every: usize = flag(&args, "--eval-every", 0); let eval_batches: usize = flag(&args, "--eval-batches", 64); + // Dropout (Phase T18): residual-path dropout prob, active at training time + // only (inverted scaling), identity at eval/sampling/export. Default 0 = off + // (forward graph bit-identical to the no-dropout path). + let dropout: f32 = flag(&args, "--dropout", 0.0f32); // bf16 mixed precision (Phase T12): fp32 master weights, bf16 linears + // activations. Opt-in; default fp32 reproduces v0–v4 numerics. let bf16 = args.iter().any(|a| a == "--bf16"); @@ -149,7 +153,8 @@ fn main() { (corpus, None) }; - let cfg = Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn); + let mut cfg = Config::from_arch(vocab, n_heads, head_dim, n_layers, ffn); + cfg.dropout = dropout; println!( "model: dim {} layers {} heads {} head_dim {} ffn {} → core {:.3}M params \ (+ embed/lm {:.2}M = {:.2}M total)", @@ -183,6 +188,9 @@ fn main() { model = model.with_recompute(true); println!("activation recompute: ON (per-block gradient checkpointing)"); } + if dropout > 0.0 { + println!("dropout: ON (p={dropout}, residual-path, train-only inverted scaling)"); + } // Eval-only mode: load a checkpoint and score it on the held-out val set, then // exit. Used to put an EXISTING model (e.g. v0) and a new one on the same diff --git a/crates/xtrain-train/src/train_loop.rs b/crates/xtrain-train/src/train_loop.rs index 4d93b7f..8ba5f52 100644 --- a/crates/xtrain-train/src/train_loop.rs +++ b/crates/xtrain-train/src/train_loop.rs @@ -89,6 +89,9 @@ pub fn train( } let ids = batched_ids_tensor(&inputs, device); let targets = batched_ids_tensor(&targets_v, device); + // Training mode → dropout active (T18; no-op when cfg.dropout == 0). Set + // each step so it is restored after a periodic eval flips to eval mode. + model.train(); let loss = model.loss_batched(&ids, &targets, cfg.batch_size); let step_loss = read_scalar(&loss); loss.backward(); @@ -169,6 +172,8 @@ pub fn eval_loss( if valid.len() <= seq + 1 { return f32::NAN; } + // Eval mode → dropout is identity (T18). + model.eval(); let n_win = (valid.len() - 1) / seq; // disjoint windows that fit let batches = batches.max(1).min(n_win.max(1)); let stride = (n_win / batches).max(1); From 80fafa191437decc969849cb06ec6e4fac6241d1 Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Thu, 18 Jun 2026 00:06:06 +0800 Subject: [PATCH 5/5] docs: T18 evolution row + README build-journey row (dropout) Co-Authored-By: Claude Opus 4.8 --- README.md | 1 + docs/evolution.md | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index e93cd5f..5f1bf71 100644 --- a/README.md +++ b/README.md @@ -50,6 +50,7 @@ Each phase: design doc + implementation + tests + a scoped commit (see [`docs/`] | **T11** | **device caching allocator** (fixes KI-5) | single-GPU 2.3×; **8-GPU 461K tok/s** | | **T12** | **bf16 mixed precision** (fp32 master, fixes KI-2) | dim768 OOM solved; −29% mem | | **T13** | **activation recompute** / checkpointing (fixes KI-3) | dim1024 fits; grads bit-identical | +| **T18** | **dropout** (hand counter-based device RNG + mask, inverted scaling, train/eval switch) | fixed-seed grad-check; **p=0 bit-identical**; recompute-safe | The four performance fixes (T10–T13) each removed a real bottleneck — see [`docs/known-issues.md`](docs/known-issues.md). diff --git a/docs/evolution.md b/docs/evolution.md index da5aa27..75ab741 100644 --- a/docs/evolution.md +++ b/docs/evolution.md @@ -24,6 +24,7 @@ | T11 | Infra | **device caching/pool allocator**(复用 op 输出显存,消 per-step cudaMalloc) | 单卡 2.3×;**8卡 461K tok/s** 近线性(修 KI-5) | | T12 | 算法/Infra | **bf16 混合精度**(fp32 master,cuBLAS GemmEx,norm/softmax/CE 保 fp32) | dim768 OOM 解除,−29% 显存/+13% tok/s(修 KI-2) | | T13 | 算法/Infra | **激活重计算**(per-block gradient checkpointing:前向 no-tape + 反向重算,`backward_seeded`) | 梯度对非重计算版**逐位一致**(0.00);dim768 31.1→14.6GB;**dim1024 batch32 OOM→16.6GB 装下**(修 KI-3,解锁 v8) | +| T18 | 算法 | **dropout**(手写 counter-based 设备 RNG → Bernoulli mask,训练 inverted 1/(1-p) scaling、eval 恒等);新 autodiff `dropout` 算子(fwd 生成+施加 mask,bwd 用同 mask),接 residual/ffn 两处;`--dropout` flag 默认 0 | 固定 seed grad-check 过;E[out]≈input + keep≈1-p;**p=0 与无 dropout 逐位一致**;recompute(T13) 组合下梯度仍逐位一致(counter-based seed 重算复现同 mask);全回归 + xserv 闭环绿(导出/推理 dropout 关) | --- @@ -49,7 +50,7 @@ ## 三、各维度的累积演进(轴向看一条线怎么走的) -- **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13)。 +- **算法**:手写 autograd(tape)+扇出累加 → AdamW/LR-sched/grad-clip → +QK-norm(Qwen3) → batched forward → bf16 混合精度(fp32 master) → 激活重计算(T13) → dropout(T18,counter-based 设备 RNG + inverted scaling,train/eval 切换)。 - **模型架构**:固定 Qwen3-style;dim **32→256→384→512→768→1024**(v8 首拨容量轴,头数 24→32);核心参数 **41K→226M**(总 3.26M→329M)。 - **Infra**:单卡 fp32 → cuBLAS/GPU-optim(T7) → NCCL DDP(T8) → batched forward(T10) → caching allocator(T11) → bf16(T12) → 激活重计算(T13,解锁 dim1024)。吞吐 **3.3K→217K tok/s**(dim768 bf16),dim1024+重算 ~129K(重算税);MFU **0.4%→17%**(每次提升都对应一块 perf 基建,详见 known-issues + MFU 分析)。 - **数据集**:TinyStories 3MB 切片 → 全量 TinyStories(epoch 0.01→5.33,**至饱和**)→ **v6 毕业到 FineWeb-edu 真实网页**(2.255B 语料,1.02ep)→ **v7 同子集多 epoch(1.45ep,近顶)→ v8 同子集换大模型**(dim1024,1.05ep)。tokenizer 全程 gpt2 BPE(复用 xserv-tokenizer;v6 刻意不换 tokenizer 以隔离「数据来源」变量,KI-4 留后续版本)。