Files
xtrain/crates/xtrain-distributed
Gahow Wang f202351be5 model: per-block activation recompute (--recompute)
Wrap each transformer block's forward in the checkpoint primitive when
recompute is enabled (Phase T13 / KI-3). To make the block forward a pure
segment fn (no `&self` borrow, so it can re-run in the backward closure),
extract the block body + its helpers (linear / norm_gamma / attention /
swiglu_mlp) into free functions parameterised by (cfg, compute_dtype) and add
`Block::block_params()` (the 11 leaves in the params() per-block order). The
non-recompute path calls `block_forward` directly — identical graph to before.

- `TinyTransformer::with_recompute(bool)` builder (opt-in; default off keeps the
  unchanged tape / bit-identical numerics).
- `--recompute` flag wired into bin/train and bin/train_ddp (DDP: each rank
  checkpoints independently).

Correctness gate: tests/recompute.rs builds two identical models (recompute
on/off), runs the same batched loss+backward, and asserts the forward logits,
the loss, and EVERY parameter grad match within tight fp tol — parameterised
over fp32 and bf16 (T12 composition).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:42:42 +08:00
..
2026-06-15 17:14:56 +08:00