xtrain

Author	SHA1	Message	Date
Gahow Wang	b06b553f99	test: drop unused Var import in grad_accum Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:49:04 +08:00
Gahow Wang	abe5ceb913	test: grad-accum equivalence + accum=1 bit-identity + DDP+accum - grad_accum.rs: accum=N×B grads bit-close to a single N·B big batch; accum_steps=1 bit-identical (max\|Δ\|==0) to no-accum; real train() loop with accum tracks a big-batch baseline over 20 AdamW steps. - ddp_correctness.rs: world=2 + accum=2 matches a single-GPU big batch of the same effective size (loss + cross-rank + vs-baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:40 +08:00
Gahow Wang	7a03b0054a	train+ddp: micro-batch gradient accumulation (--accum-steps) Accumulate grads over N micro-batches, then one AdamW step + zero_grad, for an effective batch of N×micro at one micro-batch's activation cost. Each micro-loss is scaled by 1/N before backward (the tape SUM-accumulates the scaled grads) so the boundary grad equals a single step over an N× batch. accum==1 skips the scale → bit-identical to the pre-T16 path. DDP: the cross-rank all-reduce fires ONLY at the accumulation boundary (intermediate micro-steps are local-only, no NCCL); the /world average is orthogonal to the per-micro 1/N, so the boundary grad is the effective global-batch mean. New --accum-steps flag in both train binaries; effective batch is printed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:45:33 +08:00
Gahow Wang	d01fec6639	docs: Phase T16 — gradient accumulation design Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:41:17 +08:00
Gahow Wang	9064ced4c2	docs: T14 flash-attention results + evolution/README rows Fill in the design doc's measured results (grad-check, flash==composed, PyTorch parity, peak mem -16%/-23%, tok/s tradeoff), add the T14 row to evolution.md (算法/Infra) and the README build-journey table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:34:10 +08:00
Gahow Wang	d217f4fbd3	perf: spread flash bwd dK/dV atomics across all threads Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:27:33 +08:00
Gahow Wang	4d7b69f8d4	perf: cache softmax weights in shared mem (drop hd× redundant expf) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:24:56 +08:00
Gahow Wang	9b05f4f93f	test: flash==composed bf16 uses robust mean/p99 metric (repo convention) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:19:08 +08:00
Gahow Wang	c0f0b67510	test: eps=2e-3 for flash dQ/dK finite-diff (cuts f32 rounding term) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:44 +08:00
Gahow Wang	80602099dc	test: scale Q/K in flash grad-check for well-conditioned grads Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:17:04 +08:00
Gahow Wang	f38beb0346	test: flash finite-diff grad-check uses single-tile clean regime Match the trusted composed grad-check dims (seq=5<FA_TILE); the multi-tile online-softmax path is gated by flash_bwd_matches_composed_bwd (seq=40), sharper than finite-diff on the near-zero grads a long softmax produces. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:16:20 +08:00
Gahow Wang	01fb22d114	test: flash bwd vs composed bwd (sharper than finite-diff) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:12:30 +08:00
Gahow Wang	5f3b81ac96	test+bins: flash grad-check, flash==composed, PyTorch parity, --flash flag autograd: flash_attention_batched_bwd (dQ/dK/dV finite-diff, seq>tile) + flash_matches_composed_fwd. model/tests/flash.rs: flash==composed on-vs-off (logits/loss/every param grad), fp32 + bf16. parity_dump: XTRAIN_PARITY_FLASH dumps the flash path for the same parity.py oracle (PyTorch SDPA parity at B>1). train + train_ddp get the --flash flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:39 +08:00
Gahow Wang	0e20821633	autodiff+model: flash-attention op + --flash opt-in wiring ops::flash_attention autograd node (fwd caches O(N) logsumexp instead of O(N²) probs; bwd via Tensor::flash_attention_backward). Model gets a use_flash bool + with_flash(bool) builder; the SDPA core in attention() picks ops::flash_attention vs ops::attention. flash threads through block_forward so the recompute (T13) segment also runs flash. Default off = composed path, graph unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:32 +08:00
Gahow Wang	326a6fadfe	cuda: fused flash-attention kernel (fwd + flash-style bwd) csrc/ops/flash_attention.cu: a single fused fwd kernel (one block per query row, streams KV in tiles of 32, online softmax — running max/sum + rescaled V accumulator, causal mask inlined, never materializes the [bh,S,S] scores) writing out[bh,S,hd] + the per-row logsumexp L (O(N), saved for backward). flash-style bwd: recompute scores from Q/K/V + L, collapse the softmax Jacobian with D[i]=ΣdO·O, dQ owned per row, dK/dV atomicAdd across rows. Tensor::flash_attention / flash_attention_backward wrap them (bf16 upcasts Q/K/V→f32 for the kernel, same fp32-softmax policy as composed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:25 +08:00
Gahow Wang	65a2264227	docs: Phase T14 — fused flash-attention design Design doc for the hand-written single fused flash-attention kernel: online softmax tiled over KV, NEVER materializing the [bh,S,S] score matrix; flash-style backward (recompute scores from saved logsumexp + D=ΣdO·O, dQ/dK/dV). Opt-in --flash; composed T10 path stays default. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 23:10:16 +08:00
Gahow Wang	31cc2bf745	docs: capstone README — full-stack + scaling study (v0-v8) writeup Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 16:17:26 +08:00
Gahow Wang	511f35d40c	docs: run v8 — dim1024 capacity helps (val 2.98) v8 = capacity-axis A/B: freeze the v6/v7 2.255B FineWeb-edu subset, scale dim768→dim1024 (core 127M→226M, +78%) via bf16 + T13 activation recompute. 8-GPU DDP, 2.36B tok (1.05 ep), ~129K tok/s (recompute tax), ~5h. Result (same FineWeb val, v6/v7/v8 comparable): v6 3.0652 / v7 3.0149 / v8 2.9801. Capacity helps — v8 (1.05ep) beats v6 at the same ~1ep by 0.085 AND beats v7 (smaller model, 1.45ep more old data) by 0.035 ⇒ v6/v7 were partly capacity-limited, scaling capacity > repeating old data. But the gain is only ~3% (same magnitude as the data-axis single-step lever), and v8's val was still descending at the end (not saturated). Meta-finding: every single-axis lever (data-volume v5/v7, breadth v6, capacity v8) is now ~3%/lever ⇒ broad diminishing returns; to progress, scale capacity AND data together (Chinchilla, reproduced at toy scale). - docs/runs/08-v8-fineweb-edu-dim1024.md: full capacity experiment + v7-vs-v8 samples - docs/runs/README.md: +v8 row, v9 proposal - docs/evolution.md: +T13 infra row, +v8 scaling row, capacity-axis & diminishing-returns notes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 15:12:01 +08:00
Gahow Wang	0150263055	perf: KI-3 fixed — dim1024 batch32 fits, mem 31.1→14.6GB, tok/s 39.7K→31.5K Per-block activation recompute (T13) measured on dash5 (1× RTX 5090 32GB, bf16, batch32 seq256, steady-state): - Correctness (exact, hard gate): recompute on-vs-off grads are BIT-IDENTICAL — fp32 AND bf16: loss / logits / every param grad max rel = 0.00e0 (not "within tol", exactly equal). Full suite green with recompute on/off; DDP loss-match 5.67e-7; DDP+recompute 2-rank descends 11.079→6.010. - dim768 (18L/24h ffn2048, core 127M): peak mem 31144→14562 MiB (−53%), tok/s 39.7K→31.5K (−20%, the extra-forward tradeoff, in the predicted 20–35% band). - dim1024 (18L/32h ffn2730, core 226M): recompute OFF OOMs (hits 32100/32607 MiB → OutOfMemory); recompute ON fits at 16596 MiB, ~23K tok/s, converges. → KI-3 payoff achieved: dim1024 batch32 unblocked, v8 can proceed. Fill docs/12 bench table; mark KI-3 FIXED in docs/known-issues.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:50:29 +08:00
Gahow Wang	69c5f07359	docs: Phase T13 — activation recompute Design doc for per-block gradient checkpointing (KI-3): the no-tape forward + recompute-on-backward design, the `checkpoint` primitive, per-block wrapping, the exactness/correctness argument (same kernels + inputs → identical grads), composition with bf16+DDP+batched, and the verification plan (on-vs-off grad gate + memory/throughput before→after, dim1024-fits). Bench table left as TBD to fill after the dash5 run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:45:16 +08:00
Gahow Wang	f202351be5	model: per-block activation recompute (--recompute) Wrap each transformer block's forward in the checkpoint primitive when recompute is enabled (Phase T13 / KI-3). To make the block forward a pure segment fn (no `&self` borrow, so it can re-run in the backward closure), extract the block body + its helpers (linear / norm_gamma / attention / swiglu_mlp) into free functions parameterised by (cfg, compute_dtype) and add `Block::block_params()` (the 11 leaves in the params() per-block order). The non-recompute path calls `block_forward` directly — identical graph to before. - `TinyTransformer::with_recompute(bool)` builder (opt-in; default off keeps the unchanged tape / bit-identical numerics). - `--recompute` flag wired into bin/train and bin/train_ddp (DDP: each rank checkpoints independently). Correctness gate: tests/recompute.rs builds two identical models (recompute on/off), runs the same batched loss+backward, and asserts the forward logits, the loss, and EVERY parameter grad match within tight fp tol — parameterised over fp32 and bf16 (T12 composition). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:42:42 +08:00
Gahow Wang	c396b39483	autodiff: checkpoint primitive (recompute-on-backward) Add `xtrain_autodiff::checkpoint::checkpoint(segment_fn, input, params)`, a higher-order autograd node (à la torch.utils.checkpoint) for activation recomputation (Phase T13 / KI-3): - forward: run `segment_fn` on detached leaves so its internal ops are NOT recorded on the outer tape; keep only the output value (the local sub-tape — and thus the segment's intermediate activations — drops immediately). The checkpoint node's parents are [input, ..params]. - backward: re-run `segment_fn` from the saved input + (unchanged) param values into a fresh local tape, seed the recomputed output with the upstream grad, backprop, then push the recovered input/param grads to the real parents. Local tape drops at the end → recomputed activations freed. Exact by construction (same deterministic kernels, same inputs) → grads match the non-checkpointed path. Composes with bf16 (T12, same path on recompute) and DDP (T8, per-rank). Supporting change: `Var::backward_seeded(seed)` — backward from an explicit non-scalar upstream grad (the segment output is generally not a scalar); `backward()` is now the scalar wrapper that seeds ones. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 09:42:31 +08:00
Gahow Wang	9c557f0609	docs: run v7 — FineWeb subset near-ceiling at dim768 (val 3.01) v7 = same arch as v4/v5/v6 (dim768/18L, bf16, 8-GPU DDP global 256), trained the SAME 2.255B-token FineWeb-edu subset to 1.45 epoch (vs v6's 1.02), best FineWeb val 3.0149 (v6 3.0652). Exported + archived to registry v7-fineweb-edu-dim768, serves in xserv (coherent expository English, ~v6 quality). Key finding: more epochs of the SAME subset gave only ~0.05 val drop and the curve flattened (~step 44000) with no sampling quality gain → the 2.255B FineWeb subset is near its ceiling at dim768. Same class as v5's TinyStories data-volume saturation: repeating old data has thin margins; true further gains need FRESH shards (more diverse tokens), as v6's corpus-swap (which raised the ceiling) showed. Adds docs/runs/07-v7-*.md; updates docs/runs/README.md (+v7 row, intro saturation note, v8 proposal) and docs/evolution.md (+v7 row, dataset-axis ceiling note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 03:55:47 +08:00
Gahow Wang	b4bb426d48	docs: run v6 — FineWeb-edu graduation (val 3.07, new distribution) 第一版脱离 TinyStories：纯 FineWeb-edu 真实网页文本(2.255B 语料)，架构同 v4/v5(dim768/18L, core 127.43M)，8 卡 DDP bf16，2.29B tok/1.02ep，~1.9h @218K tok/s。train 11.03→3.14，best/final FineWeb val 3.0652。方法论：FineWeb val(3.07) 与 v0–v5 的 TinyStories val(~1.1) 不可比——真实网页熵高，~3.0 是预期非回退；判据是采样质量 + transfer eval。 - 新增 docs/runs/06-v6-fineweb-edu-dim768.md：数据管线(scripts/fineweb_to_txt.py) / 架构(同 v4/v5,隔离数据变量) / 超参 / 结果(val 单调降无走平=未饱和) / 方法论说明 / transfer eval(v6→TinyStories val 2.75 vs v5 native 1.11,纯通用数据对窄分布有代价) / v5-vs-v6 同提示词采样对比(v6 写真实说明文 vs v5 一律掉进小故事) - README 对比表加 v6 行(val 单独标注分布) + 换轴说明 + v7 提案 - evolution.md scaling 表 v6 行定稿 + 数据轴 TinyStories→FineWeb-edu 毕业说明 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 22:21:43 +08:00
Gahow Wang	88bec270af	docs: evolution overview — per-milestone changes across algorithm/arch/infra/dataset axes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:30:52 +08:00
Gahow Wang	7e5ea9976b	data: FineWeb-edu parquet->txt prep script (Scaling v6) v6 broadens data from TinyStories to FineWeb-edu (HuggingFaceFW/fineweb-edu sample/10BT) while freezing the v4/v5 arch. scripts/fineweb_to_txt.py streams the parquet text column row-group by row-group and joins docs with <\|endoftext\|> so xtrain's existing Corpus loader (gpt2 BPE, u16 cache) handles it unchanged. Corpus .txt/.parquet/.u16.bin stay dash5-only (gitignored). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:04:45 +08:00
Gahow Wang	579365f4a0	docs: run v5 — TinyStories saturation at dim768 (val 1.11) 设计文档 05-v5-tinystories-dim768.md（中文，xserv 风格）：数据 2.49B tok/5.33ep、架构同 v4(净测数据变量)、bf16 8 卡 global 256、train 11.07→1.06 best val 1.1102。核心发现「数据天花板」：v4(1.54ep)1.169→v5(5.33ep)1.110 仅 ↓5% 且末段 val 走平 ⇒ TinyStories 在 dim768/127M-core 近饱和，v6 该换轴（更大模型/更广语料，非更多 TinyStories）。 xserv BF16 服务 3/3 prompt 逐 token 一致。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	8a1e29543b	run: v5 archive + export (dim768, bf16, 5.33ep, val 1.11) v0–v5 对比表加 v5 行 + tokens-trained / epoch 两列，让 TinyStories 数据饱和可见（v4→v5 同 arch 数据 ×3.5 仅 val ↓5% 且末段走平）。下一档提案改为 v6 换轴。导出 201 tensors + RUN.md 存入 dash5 registry v5-tinystories-dim768（checkpoint/safetensors 不入库）。 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 17:56:25 +08:00
Gahow Wang	5b7dde1736	test: bf16 test reads f32-cast logits (forward now returns bf16) The `keep bf16 logits` change made forward_batched return bf16 logits in bf16 mode; the bf16 test's host read must cast to f32 first. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:29:24 +08:00
Gahow Wang	320c1ae4fb	perf: KI-2 FIXED — dim768 bf16 fits batch 32, tok/s 31.5K→40.8K bf16 mixed precision (fp32 master) solves the v4 dim768 fp32 batch-32 OOM and speeds up the now-compute-bound dim768 GEMMs (dash5 1× RTX 5090 32GB, dim768/18L/24h×32 ffn2048 seq256, steady-state): config batch peak mem tok/s fits 32GB fp32 16 27.2 GB 31.5K yes bf16 16 19.3 GB 35.5K yes (-29% mem / +13% tok/s) fp32 32 — — OOM bf16 32 31.1 GB 40.8K yes (+29% vs fp32-b16) Verified on dash5: fp32 suite green at tight tol + xserv export md5 bit-identical to registry; bf16 looser-tol (loss 1.2e-4, logits p99 6.8e-3, grad 1.0e-2) + 150-step convergence tracks fp32 (3.984 vs 3.988); 2-GPU bf16 DDP at per-rank batch 32 trains cleanly. Mark KI-2 FIXED; fill docs/11 results. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:28:20 +08:00
Gahow Wang	48922cb628	perf: keep bf16 logits (no persistent fp32 logits buffer) At vocab 50257 the logits tensor [B*S, vocab] is ~1.6GB fp32 at batch 32 — held across the whole backward. Keep it bf16: cross_entropy upcasts the bf16 logits to fp32 internally (transient) + caches fp32 probs, and its backward casts dx back to bf16 to chain into the bf16 lm_head matmul backward. The sampler casts bf16 logits→f32 before the host argmax/softmax. Halves the persistent logits activation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:20:48 +08:00
Gahow Wang	30db62d8f2	docs: Phase T12 — bf16 mixed precision design docs/11-bf16-mixed-precision.md: the AMP split (bf16 linears + activations, fp32 master / norms / softmax / RoPE / CE, no loss scaling), the cast-op bridge, module layout, and the dual verification gate (fp32 unchanged + bf16 looser-tol + convergence + mem/throughput). Memory/throughput before->after to be filled from the dash5 bench. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:15:02 +08:00
Gahow Wang	0a2a4dcaa8	train: --bf16 flag (fp32-master AMP) + bf16 correctness test - TinyTransformer::with_compute_dtype(BF16): embedding stays fp32 master then casts to bf16; each linear casts its fp32 weight to bf16 on the fly; logits cast back to fp32 for cross-entropy. Default F32 reproduces the v0-v4 forward graph bit-for-bit. - --bf16 flag on bin/train and bin/train_ddp (off by default). - tests/bf16.rs: same fp32 master weights run fp32 vs bf16; assert loss/logits/grads within a loose bf16 tol, no NaN, and grads are fp32 (master untouched). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:55 +08:00
Gahow Wang	b0086b5214	autodiff: bf16 mixed-precision path (fp32 master via cast op) Tensor ops dispatch on dtype: fp32 branch unchanged (bit-identical), bf16 branch routes matmul/attention through GemmEx and elementwise through the bf16 kernels. Norm/softmax/RoPE/cross-entropy upcast to fp32 around the existing fp32 kernels (standard AMP: reductions/loss fp32, matmuls bf16). Transposes route bf16 through fp32 (pure layout). New autodiff `cast` op is the AMP bridge: forward downcasts a fp32 master leaf to bf16 for the matmul; backward upcasts the bf16 grad back to fp32. So the fp32 leaf accumulates an fp32 grad and AdamW / clip / DDP all-reduce stay fp32 and completely unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:48 +08:00
Gahow Wang	d05115ddf3	cuda: bf16 cuBLAS GemmEx (16BF in/out, fp32 accum) + cast kernels Add the bf16 compute primitives for T12 mixed precision: - DType::BF16 (half::bf16 as TensorDType), 2 bytes. - cublasGemmEx / cublasGemmStridedBatchedEx FFI + CUDA_R_16BF / CUBLAS_COMPUTE_32F constants (values per xserv gemm.rs). - cublas::gemm_ex / gemm_ex_strided_batched: same row-major⟺col-major transpose algebra as sgemm, bf16 in/out, fp32 accumulation. - csrc/ops/cast.cu: f32<->bf16 cast + bf16 elementwise (add/mul/scale/ silu(+dx)/add_bias/sum_rows), each load->fp32->compute->store bf16. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 14:14:39 +08:00
Gahow Wang	511ceebbb3	docs: KI-2 trigger — dim768 fp32 batch-32 OOM v4 surfaced the concrete bf16 trigger: dim768 fp32 OOMs at per-rank batch 32 (global 256) in 32GB, forcing per-rank 16 (global 128). bf16 (halve activation mem) would restore the batch-256 sweet spot. Record it on KI-2; keep KI-2 as the backlog item it is (still deferred). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:42 +08:00
Gahow Wang	ff79fee3c5	docs: run v4 — TinyStories, dim768, val 1.17 Design doc docs/runs/04-v4-tinystories-dim768.md (data 720.9M tok ~1.54ep / arch dim768/18L core 127.4M vs v3 / hparams 22000 steps, global batch 128 per-rank 16, seq 256, lr 6e-4->6e-5 warmup 1100 + cosine, clip 1.0, world=8 DDP fp32 / results train 11.07->1.14, best val 1.1690, ~145K tok/s 8-GPU / v3->v4 improvement: val 1.30->1.17 + side-by-side samples). Notes that this run validated T11's caching allocator at dim768 multi-GPU and that dim768 fp32 batch-32 OOM is the bf16 trigger. Update docs/runs/README.md comparison table to v0/v1/v2/v3/v4 and the next-rung proposal to v5. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:37 +08:00
Gahow Wang	734e119db3	run: v4 archive + export (dim768, 8-GPU DDP, val 1.17) v4 scaling run finished: dim768/18L, core 127.43M (total 204.63M), trained 720.9M tokens (~1.54 epoch) on 8x RTX 5090 DDP fp32, ~145K tok/s, ~84 min, best val 1.1690. Checkpoint archived to registry (~/projects/tiny-models/v4-tinystories-dim768/) and exported to xserv HF Qwen3 safetensors (201 tensors, BF16); xserv serves it and matches xtrain greedy token-for-token on all 3 fixed prompts (40 tok). Add `greedy_sample` bin: load a trained ckpt with its arch flags and print xtrain's own greedy continuations for the fixed run prompts, so they can be diffed against xserv's greedy on the exported weights (the per-run token-match check). Same model/config/init scheme as bin/train.rs + bin/export_safetensors.rs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 13:14:28 +08:00
Gahow Wang	f85bd4d276	perf: KI-5 FIXED — single-GPU 40K->93K tok/s, DDP scaling 1.3x->5x@8 Device caching/pool allocator removes the per-op cudaMalloc serialization that was the real DDP bottleneck (and a single-GPU cost). Measured on dash5 (8x RTX 5090, dim384/12L, per-rank batch 32, seq 256, steady-state tok/s): single-GPU: 40226 -> 92638 tok/s (~2.3x) DDP scaling (global batch 32*world): world before after 1 39801 1.00x 92385 1.00x 2 47229 1.19x 146821 1.59x 4 52854 1.33x 269867 2.92x 8 48996 1.23x 461270 4.99x 8-GPU absolute throughput 49K -> 461K tok/s (9.4x); nvidia-smi shows all 8 GPUs at 95-99% util during the run (KI-5 saw only 1-2/8 busy). Loss trajectories are bit-identical before/after (10.9026->4.8453). xserv closed loop green: re-export of the v3 ckpt is md5-identical to the registry safetensors and xserv serves it. Mark KI-5 FIXED in docs/known-issues.md with before/after table; fill in the design doc's measured numbers. Residual ~5x@8 (not perfectly linear) is the ~7% all-reduce + 8-GPU PCIe/launch overhead; process-per-GPU is the next lever if v4 needs higher linearity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:15:02 +08:00
Gahow Wang	4c3f332f64	docs: Phase T11 — caching allocator Design doc for the device caching/pool allocator (KI-5 re-diagnosis recap, size classes, per-device + thread-safety, Drop->return, transparency/correctness argument, why skip-memset uninit is deferred, dual verification gates). Before/ after numbers filled after dash5 measurement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00
Gahow Wang	b7104e2cb7	test: loosen flaky DDP cross-rank assertion to <1e-6; scale to world=8 The cross-rank `max\|p0-p1\| == 0.0` check is flaky on this PCIe-only box: NCCL's all-reduce is not bit-reproducible run-to-run across ranks (algorithm/chunk choice is unstable), so cross-rank params can differ by a few ULP (observed <=1.2e-7) even with identical init + averaged grads. The load-bearing gate is the loss-trajectory match (~5.7e-7); a tight <1e-6 tolerance is the honest invariant. Also extend ddp_throughput_scaling to include world=8 for the KI-5 before/after scaling table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:11 +08:00
Gahow Wang	28801fbfe5	cuda: device caching allocator (pool GpuBuffer alloc) Every tape op allocates its output via Tensor::zeros -> GpuBuffer::alloc -> cudaMalloc, a synchronous process-serialized driver call. Under the single- process thread-per-GPU DDP model the rank threads' hundreds of per-step allocs serialize through the driver (KI-5 root cause); it costs single-GPU too. Add a per-device, size-classed caching pool: GpuBuffer::alloc serves from a free-list (request rounded up to a size class so repeating training shapes reuse buffers), only cudaMalloc on a miss; Drop returns the buffer to the pool instead of cudaFree. Thread-safe via a global registry keyed by device id with each device's free-list behind its own Mutex (registry lock held only to clone out the per-device Arc<Mutex<_>>, so rank threads don't contend across devices). The buffer records its alloc-time device so Drop returns to the right pool. Transparent: physical capacity may be rounded up, but len()/memset/copy bounds all use the requested length, so the rounded tail is never read and numerics are unchanged. zeros() still memsets (reused buffers hold stale bytes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 11:04:02 +08:00
Gahow Wang	d422c68704	docs: KI-5 — correct cross-rank divergence attribution (pre-existing flaky) The ~1-ULP cross-rank param divergence is NOT caused by coalescing: the original ungrouped all-reduce is itself run-to-run nondeterministic on this box (6 reruns: cross-rank diff {0, 0, 5.96e-8, 5.96e-8, 1.19e-7, 1.19e-7}), so the T8 test's `max\|p0-p1\| == 0.0` assertion is flaky here (passes ~1/3 of runs) independent of T11. Diffs are ≤1.19e-7 (a few ULP, numerically benign; loss-match stays ~6e-7). Noted as a follow-up to loosen the assertion to a tight tolerance; coalescing was reverted purely because it gives ~0 scaling benefit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:42:13 +08:00
Gahow Wang	84092fb28d	docs: KI-5 re-diagnosis — all-reduce is NOT the DDP bottleneck (T11) T11 set out to coalesce/overlap the gradient all-reduce per the original KI-5 hypothesis. Profiling on dash5 (8× RTX 5090, dim384, per-rank batch 32, seq 256) falsifies that hypothesis: - grad all-reduce is only ~6-7% of each step; - per-rank fwd+bwd inflates ~linearly with world (136→780 ms for the SAME per-rank workload) and dominates; - coalescing the ~150 per-tensor all-reduces into one grouped/flat launch gives ~0 scaling gain AND breaks cross-rank bit-identity (max\|p0-p1\| 0.0 → 1.49e-8), violating the T8 correctness gate — so the coalescing commit (`b8b5821`) was reverted. Real bottleneck (NOCOMM=1 still inflates; util shows 1-2 of 8 GPUs busy at a time; CPU not starved; per-thread default stream doesn't help): single-process thread-per-GPU ranks serialize on the single CUDA context's per-op cudaMalloc / driver calls. Fix direction (out of T11 scope): a caching/pool allocator, or process-per-GPU. Recorded in docs/known-issues.md with the measured table; KI-5 stays Open. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:40:45 +08:00
Gahow Wang	88c2c15768	Revert "dist: coalesce grads into buckets for all-reduce (KI-5)" This reverts commit `b8b58212dc`.	2026-06-16 09:39:38 +08:00
Gahow Wang	b8b58212dc	dist: coalesce grads into buckets for all-reduce (KI-5) Replace the per-parameter eager all-reduce (~150 tiny serial NCCL calls for dim512, DDP's dominant cost after T10's batched forward) with a coalesced bucketed all-reduce: pack grads into a few large contiguous scratch buffers, all-reduce each bucket once (fused via ncclGroupStart/ End), fold the 1/world average into one per-bucket scale, unpack back. The packed buffer is the concatenation of the grad tensors, so NCCL's element-wise sum over a bucket equals the per-tensor sums — bit-identical to the un-bucketed path; only launch/latency overhead is removed. DDP cross-rank param identity + loss-match are preserved. Adds xtrain_cuda::device::copy_d2d (cudaMemcpy D2D) for the pack/unpack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:09:44 +08:00
Gahow Wang	a78502e0f0	docs: run v3 — TinyStories, dim512, val 1.30 Per-run design doc docs/runs/03-v3-tinystories-dim512.md (data 245.8M tok full TinyStories ~0.53 epoch / arch dim512 16L core 67.13M vs total 118.59M, what changed vs v2 / hyperparams 30000 steps batch 32 seq 256 lr 6e-4→6e-5 warmup 1500 + cosine clip 1.0 single-GPU batched via T10 / results train 10.91→1.40 best val 1.3027 ~26K tok/s / improvement vs v2 1.71→1.30 with side-by-side samples). Notes v3 validated T10 batched forward at scale and avoided KI-5 by staying single-GPU; v4 proposal + open levers (KI-2/3/4/5, data ladder). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:45 +08:00
Gahow Wang	64b2a8c09e	run: v3 archive + export (dim512, single-GPU batched, val 1.30) v3 trained (30000 steps × batch 32 × seq 256 = 245.8M tok, ~0.53 epoch), single-GPU batched via T10 (~26K tok/s, ~2.65h). Archived to registry ~/projects/tiny-models/v3-tinystories-dim512/ (xtrain.ckpt + config.json + model.safetensors BF16 179 tensors + tokenizer.json + RUN.md) and served in xserv (loads 16L/dim512 qwen3, 2/3 prompts token-match xtrain greedy; 3rd diverges on BF16 drift as in v1/v2). best/final val 1.3027 (beats ~1.4 target). val ladder on the same held-out 1M-token set: v0 3.80 / v1 2.58 / v2 1.71 / v3 1.30. T10 (batched forward) validated at scale (KI-1 root cause = launch-bound, not all-reduce); single-GPU avoids KI-5. Update docs/runs/README.md comparison table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 03:37:36 +08:00
Gahow Wang	9a25616a30	docs: Phase T10 — batched forward docs/09-batched-forward.md: the launch-bound diagnosis recap, the [B*S,dim]-flatten + fused batched-attention design (RoPE per-seq position + causal masking inline in softmax), the attention forward/backward via strided-batched GEMM, autograd implications, the looped-split/merge dead-end post-mortem (1127 tok/s, host round-trips), verification methods + before→after throughput, and the v3 recommendation (per-rank batch 16-32, single/small world until KI-5 bucketed all-reduce lands). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:50 +08:00
Gahow Wang	4ccab0fb42	perf: KI-1 fixed — GPU util 0-15%→37-54%, tok/s 1653→25627 (15.5x) Mark KI-1 (single-sequence launch-bound, the root cause of "DDP weak scaling") FIXED by the T10 batched forward. dim384/12L, batch 16, seq 256, 1 GPU, back-to-back A/B: before (single-seq): ~1653 tok/s, GPU util 0-15%, ~3 GB after (batched): 25627 tok/s (batch16) / 40263 (batch32), util 37% mean / 54% peak, ~10 GB → single-GPU ~15.5x (batch16) / ~24x (batch32); util 0-15% → 37-54%. A single GPU at batch 32 (40K tok/s) now beats the old 4-GPU setup (3163) ~12x. The v3 falsification history (larger batch doesn't help a single-seq design) is kept. DDP residual weak scaling is a NEW, higher-level bottleneck batching exposes (eager all-reduce of all params each step) → recorded as KI-5 (bucketed/overlapped all-reduce), out of T10 scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:44:43 +08:00

1 2 3 4

154 Commits