Commit Graph

95 Commits

Author SHA1 Message Date
80602099dc test: scale Q/K in flash grad-check for well-conditioned grads
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:17:04 +08:00
f38beb0346 test: flash finite-diff grad-check uses single-tile clean regime
Match the trusted composed grad-check dims (seq=5<FA_TILE); the multi-tile
online-softmax path is gated by flash_bwd_matches_composed_bwd (seq=40),
sharper than finite-diff on the near-zero grads a long softmax produces.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:16:20 +08:00
01fb22d114 test: flash bwd vs composed bwd (sharper than finite-diff)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:12:30 +08:00
5f3b81ac96 test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag
autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile)
+ flash_matches_composed_fwd. model/tests/flash.rs: flash==composed
on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump:
XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle
(PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:39 +08:00
0e20821633 autodiff+model: flash-attention op + --flash opt-in wiring
ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of
O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a
use_flash bool + with_flash(bool) builder; the SDPA core in attention()
picks ops::flash_attention vs ops::attention. flash threads through
block_forward so the recompute (T13) segment also runs flash. Default
off = composed path, graph unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:32 +08:00
326a6fadfe cuda: fused flash-attention kernel (fwd + flash-style bwd)
csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per
query row, streams KV in tiles of 32, online softmax — running max/sum
+ rescaled V accumulator, causal mask inlined, never materializes the
[bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N),
saved for backward). flash-style bwd: recompute scores from Q/K/V + L,
collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV
atomicAdd across rows. Tensor::flash_attention / flash_attention_backward
wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax
policy as composed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:25 +08:00
65a2264227 docs: Phase T14 — fused flash-attention design
Design doc for the hand-written single fused flash-attention kernel:
online softmax tiled over KV, NEVER materializing the [bh,S,S] score
matrix; flash-style backward (recompute scores from saved logsumexp +
D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 23:10:16 +08:00
31cc2bf745 docs: capstone README — full-stack + scaling study (v0-v8) writeup
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:17:26 +08:00
511f35d40c docs: run v8 — dim1024 capacity helps (val 2.98)
v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale
dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute.
8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h.

Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 /
v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085
AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were
partly capacity-limited, scaling capacity > repeating old data. But the gain
is only ~3% (same magnitude as the data-axis single-step lever), and v8's
val was still descending at the end (not saturated).

Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6,
capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress,
scale capacity AND data together (Chinchilla, reproduced at toy scale).

- docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples
- docs/runs/README.md: +v8 row, v9 proposal
- docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 15:12:01 +08:00
0150263055 perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K
Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16,
batch32 seq256, steady-state):

- Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL —
  fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within
  tol", exactly equal). Full suite green with recompute on/off; DDP loss-match
  5.67e-7; DDP+recompute 2-rank descends 11.079→6.010.
- dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s
  39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band).
- dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607
  MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges.
  → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed.

Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:50:29 +08:00
69c5f07359 docs: Phase T13 — activation recompute
Design doc for per-block gradient checkpointing (KI-3): the no-tape forward +
recompute-on-backward design, the `checkpoint` primitive, per-block wrapping,
the exactness/correctness argument (same kernels + inputs → identical grads),
composition with bf16+DDP+batched, and the verification plan (on-vs-off grad
gate + memory/throughput before→after, dim1024-fits). Bench table left as TBD
to fill after the dash5 run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:45:16 +08:00
f202351be5 model: per-block activation recompute (--recompute)
Wrap each transformer block's forward in the checkpoint primitive when
recompute is enabled (Phase T13 / KI-3). To make the block forward a pure
segment fn (no `&self` borrow, so it can re-run in the backward closure),
extract the block body + its helpers (linear / norm_gamma / attention /
swiglu_mlp) into free functions parameterised by (cfg, compute_dtype) and add
`Block::block_params()` (the 11 leaves in the params() per-block order). The
non-recompute path calls `block_forward` directly — identical graph to before.

- `TinyTransformer::with_recompute(bool)` builder (opt-in; default off keeps the
  unchanged tape / bit-identical numerics).
- `--recompute` flag wired into bin/train and bin/train_ddp (DDP: each rank
  checkpoints independently).

Correctness gate: tests/recompute.rs builds two identical models (recompute
on/off), runs the same batched loss+backward, and asserts the forward logits,
the loss, and EVERY parameter grad match within tight fp tol — parameterised
over fp32 and bf16 (T12 composition).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:42:42 +08:00
c396b39483 autodiff: checkpoint primitive (recompute-on-backward)
Add `xtrain_autodiff::checkpoint::checkpoint(segment_fn, input, params)`, a
higher-order autograd node (à la torch.utils.checkpoint) for activation
recomputation (Phase T13 / KI-3):

- forward: run `segment_fn` on detached leaves so its internal ops are NOT
  recorded on the outer tape; keep only the output value (the local sub-tape —
  and thus the segment's intermediate activations — drops immediately). The
  checkpoint node's parents are [input, ..params].
- backward: re-run `segment_fn` from the saved input + (unchanged) param values
  into a fresh local tape, seed the recomputed output with the upstream grad,
  backprop, then push the recovered input/param grads to the real parents. Local
  tape drops at the end → recomputed activations freed.

Exact by construction (same deterministic kernels, same inputs) → grads match
the non-checkpointed path. Composes with bf16 (T12, same path on recompute) and
DDP (T8, per-rank).

Supporting change: `Var::backward_seeded(seed)` — backward from an explicit
non-scalar upstream grad (the segment output is generally not a scalar);
`backward()` is now the scalar wrapper that seeds ones.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 09:42:31 +08:00
9c557f0609 docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01)
v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256),
trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's
1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to
registry v7-fineweb-edu-dim768, serves in xserv (coherent expository
English, ~v6 quality).

Key finding: more epochs of the SAME subset gave only ~0.05 val drop and
the curve flattened (~step 44000) with no sampling quality gain → the
2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's
TinyStories data-volume saturation: repeating old data has thin margins;
true further gains need FRESH shards (more diverse tokens), as v6's
corpus-swap (which raised the ceiling) showed.

Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro
saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis
ceiling note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 03:55:47 +08:00
b4bb426d48 docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution)
第一版脱离 TinyStories:纯 FineWeb-edu 真实网页文本(2.255B 语料),架构同
v4/v5(dim768/18L, core 127.43M),8 卡 DDP bf16,2.29B tok/1.02ep,~1.9h
@218K tok/s。train 11.03→3.14,best/final FineWeb val 3.0652。

方法论:FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实
网页熵高,~3.0 是预期非回退;判据是采样质量 + transfer eval。

- 新增 docs/runs/06-v6-fineweb-edu-dim768.md:数据管线(scripts/fineweb_to_txt.py)
  / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) /
  方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用
  数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律
  掉进小故事)
- README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案
- evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 22:21:43 +08:00
88bec270af docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:30:52 +08:00
7e5ea9976b data: FineWeb-edu parquet->txt prep script (Scaling v6)
v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu
sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams
the parquet text column row-group by row-group and joins docs with
<|endoftext|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles
it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 19:04:45 +08:00
579365f4a0 docs: run v5 — TinyStories saturation at dim768 (val 1.11)
设计文档 05-v5-tinystories-dim768.md(中文,xserv 风格):数据 2.49B tok/5.33ep、
架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。
核心发现「数据天花板」:v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平
⇒ TinyStories 在 dim768/127M-core 近饱和,v6 该换轴(更大模型/更广语料,非更多 TinyStories)。
xserv BF16 服务 3/3 prompt 逐 token 一致。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
8a1e29543b run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11)
v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列,让 TinyStories 数据饱和可见
(v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平)。下一档提案改为 v6 换轴。
导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768(checkpoint/safetensors 不入库)。

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:56:25 +08:00
5b7dde1736 test: bf16 test reads f32-cast logits (forward now returns bf16)
The `keep bf16 logits` change made forward_batched return bf16 logits
in bf16 mode; the bf16 test's host read must cast to f32 first.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:29:24 +08:00
320c1ae4fb perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K
bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32
OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX
5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state):

  config          batch  peak mem   tok/s   fits 32GB
  fp32            16      27.2 GB    31.5K   yes
  bf16            16      19.3 GB    35.5K   yes   (-29% mem / +13% tok/s)
  fp32            32      —          —       OOM
  bf16            32      31.1 GB    40.8K   yes   (+29% vs fp32-b16)

Verified on dash5: fp32 suite green at tight tol + xserv export md5
bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99
6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs
3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly.

Mark KI-2 FIXED; fill docs/11 results.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:28:20 +08:00
48922cb628 perf: keep bf16 logits (no persistent fp32 logits buffer)
At vocab 50257 the logits tensor [B*S, vocab] is ~1.6GB fp32 at batch
32 — held across the whole backward. Keep it bf16: cross_entropy
upcasts the bf16 logits to fp32 internally (transient) + caches fp32
probs, and its backward casts dx back to bf16 to chain into the
bf16 lm_head matmul backward. The sampler casts bf16 logits→f32 before
the host argmax/softmax. Halves the persistent logits activation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:20:48 +08:00
30db62d8f2 docs: Phase T12 — bf16 mixed precision design
docs/11-bf16-mixed-precision.md: the AMP split (bf16 linears +
activations, fp32 master / norms / softmax / RoPE / CE, no loss
scaling), the cast-op bridge, module layout, and the dual
verification gate (fp32 unchanged + bf16 looser-tol + convergence +
mem/throughput). Memory/throughput before->after to be filled from
the dash5 bench.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:15:02 +08:00
0a2a4dcaa8 train: --bf16 flag (fp32-master AMP) + bf16 correctness test
- TinyTransformer::with_compute_dtype(BF16): embedding stays fp32
  master then casts to bf16; each linear casts its fp32 weight to bf16
  on the fly; logits cast back to fp32 for cross-entropy. Default F32
  reproduces the v0-v4 forward graph bit-for-bit.
- --bf16 flag on bin/train and bin/train_ddp (off by default).
- tests/bf16.rs: same fp32 master weights run fp32 vs bf16; assert
  loss/logits/grads within a loose bf16 tol, no NaN, and grads are
  fp32 (master untouched).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:14:55 +08:00
b0086b5214 autodiff: bf16 mixed-precision path (fp32 master via cast op)
Tensor ops dispatch on dtype: fp32 branch unchanged (bit-identical),
bf16 branch routes matmul/attention through GemmEx and elementwise
through the bf16 kernels. Norm/softmax/RoPE/cross-entropy upcast to
fp32 around the existing fp32 kernels (standard AMP: reductions/loss
fp32, matmuls bf16). Transposes route bf16 through fp32 (pure layout).

New autodiff `cast` op is the AMP bridge: forward downcasts a fp32
master leaf to bf16 for the matmul; backward upcasts the bf16 grad
back to fp32. So the fp32 leaf accumulates an fp32 grad and AdamW /
clip / DDP all-reduce stay fp32 and completely unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:14:48 +08:00
d05115ddf3 cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels
Add the bf16 compute primitives for T12 mixed precision:
- DType::BF16 (half::bf16 as TensorDType), 2 bytes.
- cublasGemmEx / cublasGemmStridedBatchedEx FFI + CUDA_R_16BF /
  CUBLAS_COMPUTE_32F constants (values per xserv gemm.rs).
- cublas::gemm_ex / gemm_ex_strided_batched: same row-major⟺col-major
  transpose algebra as sgemm, bf16 in/out, fp32 accumulation.
- csrc/ops/cast.cu: f32<->bf16 cast + bf16 elementwise (add/mul/scale/
  silu(+dx)/add_bias/sum_rows), each load->fp32->compute->store bf16.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 14:14:39 +08:00
511ceebbb3 docs: KI-2 trigger — dim768 fp32 batch-32 OOM
v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32
(global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation
mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as
the backlog item it is (still deferred).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:42 +08:00
ff79fee3c5 docs: run v4 — TinyStories, dim768, val 1.17
Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep /
arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128
per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8
DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU /
v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run
validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32
batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table
to v0/v1/v2/v3/v4 and the next-rung proposal to v5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:37 +08:00
734e119db3 run: v4 archive + export (dim768, 8-GPU DDP, val 1.17)
v4 scaling run finished: dim768/18L, core 127.43M (total 204.63M), trained
720.9M tokens (~1.54 epoch) on 8x RTX 5090 DDP fp32, ~145K tok/s, ~84 min,
best val 1.1690. Checkpoint archived to registry
(~/projects/tiny-models/v4-tinystories-dim768/) and exported to xserv HF Qwen3
safetensors (201 tensors, BF16); xserv serves it and matches xtrain greedy
token-for-token on all 3 fixed prompts (40 tok).

Add `greedy_sample` bin: load a trained ckpt with its arch flags and print
xtrain's own greedy continuations for the fixed run prompts, so they can be
diffed against xserv's greedy on the exported weights (the per-run token-match
check). Same model/config/init scheme as bin/train.rs + bin/export_safetensors.rs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 13:14:28 +08:00
f85bd4d276 perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8
Device caching/pool allocator removes the per-op cudaMalloc serialization that
was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX
5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s):

  single-GPU: 40226 -> 92638 tok/s  (~2.3x)
  DDP scaling (global batch 32*world):
    world  before        after
      1    39801 1.00x    92385 1.00x
      2    47229 1.19x   146821 1.59x
      4    52854 1.33x   269867 2.92x
      8    48996 1.23x   461270 4.99x

8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs
at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are
bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export
of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it.

Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the
design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the
~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever
if v4 needs higher linearity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:15:02 +08:00
4c3f332f64 docs: Phase T11 — caching allocator
Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size
classes, per-device + thread-safety, Drop->return, transparency/correctness
argument, why skip-memset uninit is deferred, dual verification gates). Before/
after numbers filled after dash5 measurement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:04:11 +08:00
b7104e2cb7 test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8
The cross-rank `max|p0-p1| == 0.0` check is flaky on this PCIe-only box: NCCL's
all-reduce is not bit-reproducible run-to-run across ranks (algorithm/chunk
choice is unstable), so cross-rank params can differ by a few ULP (observed
<=1.2e-7) even with identical init + averaged grads. The load-bearing gate is the
loss-trajectory match (~5.7e-7); a tight <1e-6 tolerance is the honest invariant.

Also extend ddp_throughput_scaling to include world=8 for the KI-5 before/after
scaling table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:04:11 +08:00
28801fbfe5 cuda: device caching allocator (pool GpuBuffer alloc)
Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc ->
cudaMalloc, a synchronous process-serialized driver call. Under the single-
process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs
serialize through the driver (KI-5 root cause); it costs single-GPU too.

Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a
free-list (request rounded up to a size class so repeating training shapes
reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool
instead of cudaFree. Thread-safe via a global registry keyed by device id with
each device's free-list behind its own Mutex (registry lock held only to clone
out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices).
The buffer records its alloc-time device so Drop returns to the right pool.

Transparent: physical capacity may be rounded up, but len()/memset/copy bounds
all use the requested length, so the rounded tail is never read and numerics are
unchanged. zeros() still memsets (reused buffers hold stale bytes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:04:02 +08:00
d422c68704 docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky)
The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the
original ungrouped all-reduce is itself run-to-run nondeterministic on
this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7,
1.19e-7}), so the T8 test's `max|p0-p1| == 0.0` assertion is flaky here
(passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP,
numerically benign; loss-match stays ~6e-7). Noted as a follow-up to
loosen the assertion to a tight tolerance; coalescing was reverted purely
because it gives ~0 scaling benefit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:42:13 +08:00
84092fb28d docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11)
T11 set out to coalesce/overlap the gradient all-reduce per the original
KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch
32, seq 256) falsifies that hypothesis:

  - grad all-reduce is only ~6-7% of each step;
  - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the
    SAME per-rank workload) and dominates;
  - coalescing the ~150 per-tensor all-reduces into one grouped/flat
    launch gives ~0 scaling gain AND breaks cross-rank bit-identity
    (max|p0-p1| 0.0 → 1.49e-8), violating the T8 correctness gate — so
    the coalescing commit (b8b5821) was reverted.

Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy
at a time; CPU not starved; per-thread default stream doesn't help):
single-process thread-per-GPU ranks serialize on the single CUDA
context's per-op cudaMalloc / driver calls. Fix direction (out of T11
scope): a caching/pool allocator, or process-per-GPU. Recorded in
docs/known-issues.md with the measured table; KI-5 stays Open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:40:45 +08:00
88c2c15768 Revert "dist: coalesce grads into buckets for all-reduce (KI-5)"
This reverts commit b8b58212dc.
2026-06-16 09:39:38 +08:00
b8b58212dc dist: coalesce grads into buckets for all-reduce (KI-5)
Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls
for dim512, DDP's dominant cost after T10's batched forward) with a
coalesced bucketed all-reduce: pack grads into a few large contiguous
scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/
End), fold the 1/world average into one per-bucket scale, unpack back.

The packed buffer is the concatenation of the grad tensors, so NCCL's
element-wise sum over a bucket equals the per-tensor sums — bit-identical
to the un-bucketed path; only launch/latency overhead is removed. DDP
cross-rank param identity + loss-match are preserved.

Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 09:09:44 +08:00
a78502e0f0 docs: run v3 — TinyStories, dim512, val 1.30
Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full
TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what
changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup
1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40
best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side
samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by
staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:37:45 +08:00
64b2a8c09e run: v3 archive + export (dim512, single-GPU batched, val 1.30)
v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch),
single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry
~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json +
model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in
xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd
diverges on BF16 drift as in v1/v2).

best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out
1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward)
validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU
avoids KI-5. Update docs/runs/README.md comparison table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:37:36 +08:00
9a25616a30 docs: Phase T10 — batched forward
docs/09-batched-forward.md: the launch-bound diagnosis recap, the
[B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position +
causal masking inline in softmax), the attention forward/backward via
strided-batched GEMM, autograd implications, the looped-split/merge dead-end
post-mortem (1127 tok/s, host round-trips), verification methods + before→after
throughput, and the v3 recommendation (per-rank batch 16-32, single/small world
until KI-5 bucketed all-reduce lands).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:50 +08:00
4ccab0fb42 perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x)
Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling")
FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU,
back-to-back A/B:

  before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB
  after  (batched):    25627 tok/s (batch16) / 40263 (batch32),
                       util 37% mean / 54% peak, ~10 GB
  → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%.

A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x.
The v3 falsification history (larger batch doesn't help a single-seq design) is
kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching
exposes (eager all-reduce of all params each step) → recorded as KI-5
(bucketed/overlapped all-reduce), out of T10 scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:43 +08:00
25b032445d train: real batched step (drop loop+SUM)
Feed a real batch of B sequences as ONE batched forward/backward, replacing the
"loop B times + let the tape SUM grads + clip ×1/B" hack. CE mean over B*S rows
is already the batch-mean loss, so backward yields the batch-mean gradient
directly → clip pre-scale = 1.0.

DDP stays equivalent: each rank runs one batched forward over its b_local =
B_global/world sequences (local-mean grad Σ_local/b_local); all_reduce_average
(sum across ranks /world) = Σ_global/B_global = global batch-mean → clip
pre-scale 1.0. The ddp_correctness single-GPU baseline batches the same way.
DDP loss matches single-GPU 5.7e-7, cross-rank params bit-identical (0.0).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:33 +08:00
5353b38402 model: batched forward [B,S]
forward_batched(ids[B*S], batch)/loss_batched: run B equal-length sequences as
ONE forward over flattened [B*S] ids, so every linear is one big [B*S,dim] GEMM.
Attention reshapes to [B*nh,S,hd], runs the fused batched causal SDPA (per-seq
mask + RoPE period=S, no cross-sequence attention), writes back [B*S,dim]. The
old per-(batch,head) loop + host-round-tripping split/merge_heads + the additive
causal_mask leaf are gone. forward(ids[seq]) is now forward_batched(ids,1), so
the sampler / inference path (batch=1) is unchanged.

+batched_ids_tensor helper. New batched.rs test: batched forward == looped
single-sequence (logits identical 0.0, grads 6.4e-4, loss identical). PyTorch
parity now exercises B>1 (B=2,S=4): loss 5e-8, logits 6.9e-6, all 25 param
grads within rtol — verifying per-seq RoPE position + per-seq causal masking.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:25 +08:00
7821bd9c34 autograd: batch dim for ops (flatten linears, batched attention)
Add the batched-forward primitives. Linears/norms/elementwise/embedding/CE
already act on flat [rows,dim], so they work unchanged on [B*S,dim]; only
attention + RoPE need sequence awareness:

- RoPE: kernel takes a `period` (= seq len) so position = row % period, i.e.
  per-sequence position on a flattened batch (period == tokens = single seq).
- Fused batched causal attention: new `Tensor::attention`/`attention_backward`
  + ops node, running QKᵀ and PV as cublasSgemmStridedBatched over the B*nh
  (sequence,head) blocks (new sgemm_strided_batched binding) and a causal
  softmax kernel (scale + per-row causal mask inline) — the whole attention is
  3 launches regardless of B*nh, no per-head/per-seq loop, no host round-trip.
- transpose_4d12 ([B,S,nh,hd] <-> [B,nh,S,hd]) to lay out the batched heads.

grad-checks: new batched-rope, transpose_4d12, batched-attention dQ/dK/dV all
pass finite-diff (attn dK 1.5e-2, dQ 7.5e-3, dV 2.9e-4; rest tighter) alongside
the existing 12.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 00:44:15 +08:00
d2a585c5cb docs: KI-1 re-diagnosed in v3 — larger batch does NOT fix DDP weak scaling
v3 tested the documented mitigation (raise global_batch to amortize the
per-step all-reduce). Isolated back-to-back A/B on 4× RTX 5090, dim384/12L,
seq256:

  global_batch 32 (8/rank)  → 3163 tok/s
  global_batch 256 (64/rank)→ 3200 tok/s   (8× batch, +1.2%, within noise)

8× larger batch = 1/8 the all-reduces per token, yet no speedup → all-reduce
is NOT the bottleneck. GPU util 0–15%, mem ~2–3 GB/32 GB → the workload is
launch-bound: the single-sequence model design (each sequence its own tiny
forward/backward, per-op kernel launches) starves the GPU, and batching only
adds proportionally more serial launches. Real fix is batched (multi-sequence)
forward so GEMMs fill the GPU — a T4/T5 autograd/model change, not a batch knob.
Bucketed/overlapped all-reduce stays deferred (no value until launch-bound is
fixed). KI-1 kept Open with the corrected root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 23:20:26 +08:00
bf679f6f1f docs: run v2 — TinyStories, dim384/12L, DDP 4-card (val 1.71)
Scaling run v2 design doc + comparison-table update. v2 = dim384/12L/12h
SwiGLU ffn1536 (core 28.32M, total 66.92M), trained 4500 steps / ~36.9M
tokens on full TinyStories (reused v1 u16 cache) via NCCL DDP across 4
RTX 5090s. Best val 1.7055 (train 10.89→1.72), a clear jump over v1 2.58
and v0 3.80. Exported to xserv (135 BF16 tensors) and archived in the
dash5 registry; xserv greedy token-matches xtrain on 2/3 fixed prompts
(3rd diverges late under BF16 drift). Records the DDP weak-scaling caveat
(global batch too small → all-reduce dominates) → links docs/known-issues
KI-1; v3 proposal applies KI-1's fix (much larger global batch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:38:31 +08:00
c87a0bc44e docs: known-issues / perf backlog — KI-1 DDP weak scaling at small global batch
Surfaced by v2 (world=4, global_batch=32): ~3593 tok/s, no speedup vs v1
single-GPU. Root cause + proposed fixes recorded; also consolidates deferred
T7 items (bf16, activation recompute) and the large-vocab modeling note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:56:58 +08:00
7090b475fb train: bring DDP trainer to parity with bin/train (val + checkpoint + cache + arch)
The T8 DDP path now matches the single-GPU `bin/train`: CLI-tunable arch
(scaling-ladder rung), the cached token-id stream (`load_cached`), held-out
val-loss eval + best-val checkpointing, and LR warmup→cosine. Rank 0 owns the
val corpus and runs the no-grad eval / writes the best checkpoint (params are
bit-identical across ranks). The eval/checkpoint logic is reused from
`xtrain-train` (`eval_loss`, `checkpoint::save`) rather than duplicated.

- DdpConfig gains eval_every / eval_batches / ckpt_path.
- train_rank takes `valid: Option<&Corpus>` and returns DdpResult
  (losses + evals + best_val); launch threads the val corpus to rank 0 only.
- bin/train_ddp reworked to the bin/train CLI (positional tokenizer/corpus +
  --dim/--heads/--head-dim/--layers/--ffn/--steps/--batch/--seq/--max-lr/
  --val-tokens/--eval-every/--ckpt), reusing the u16 cache.
- DDP correctness test updated to the new signatures (semantics unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:34:40 +08:00
264660527f docs: run v1 — TinyStories full, dim256
docs/runs/01-v1-tinystories-dim256.md + docs/runs/README.md comparison table.
v1: full TinyStories train (468.3M tok, u16-cached) + dim256/8L (core 8.39M).
Same-held-out-set val loss v0 3.8050 → v1 2.5847 (−1.22); v1 samples coherent
stories vs v0's "mommy's mommy's mommy" loop; exports + serves token-identical
in xserv. Single RTX 5090, ~25.9 min, ~3310 tok/s.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:09:46 +08:00
ec8114ecbc train: --eval-ckpt eval-only mode (v0-vs-v1 same-set val loss)
Expose eval_loss() and add a --eval-ckpt <path> branch to bin/train: load an
existing checkpoint into a model of the given arch and score it on the held-out
val split, then exit. Lets v0 and v1 be measured on the identical validation set
(the acceptance metric) without a separate eval binary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 18:44:40 +08:00